色丁香婷婷,国产区久久,五月天婷婷小说

　　NLP的文本分類過程中，大多會(huì)統(tǒng)計(jì)文章的詞頻，這是分類的重要依據(jù)之一。詞頻是由一個(gè)pair組成的，word是key

frequece是value。用什么方法統(tǒng)計(jì)最好，當(dāng)然是map。用vector，list也可以實(shí)現(xiàn)，但是它們基于關(guān)鍵字的檢索效率沒有

map高，map一般是用rb-Tree實(shí)現(xiàn)的，查找效率是O(log(n))，list，vector都是線性的，查找復(fù)雜度是O(n)。

　　先上代碼。

header

        
                    
            #ifndef _WORD_FREQUENCE_
            

          
          
            #define
          
          
             _WORD_FREQUENCE_
          
          
            

            #include 
          
          
            <
          
          
            map
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            iostream
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            string
          
          
            >
          
          
            

          
          
            using
          
          
             std::map;
            

          
          
            class
          
          
             WordFrequence{
            

          
          
            public
          
          
            :
            

             WordFrequence(): file_name_(NULL){}
            

             WordFrequence(
          
          
            char
          
          
          
          
            *
          
          
            file_name): file_name_(file_name){
            

             LoadFromFile();
            

             ReplaceSymbol();
            

             parse();
            

             }
            

          
          
            private
          
          
            :
            

          
          
            char
          
          
          
          
            *
          
          
            file_name_;
            

          
          
            char
          
          
          
          
            *
          
          
            text;
            

             map
          
          
            <
          
          
            std::
          
          
            string
          
          
            , 
          
          
            int
          
          
            >
          
          
             word_frequence_map_;
            

          
          
            void
          
          
             parse();
            

          
          
            void
          
          
             ReplaceSymbol();
            

          
          
            void
          
          
             LoadFromFile();
            

          
          
            bool
          
          
             IsWhiteChar(
          
          
            const
          
          
          
          
            char
          
          
             chr);
            

             friend std::ostream
          
          
            &
          
          
          
          
            operator
          
          
            <<
          
          
            (std::ostream
          
          
            &
          
          
             os, 
          
          
            const
          
          
             WordFrequence
          
          
            &
          
          
             wf); 
            

            };
            

          
          
            #endif

cpp

        
                    
            #include 
          
          
            "
          
          
            word_frequence.h
          
          
            "
          
          
            

            #include 
          
          
            <
          
          
            string
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            iostream
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            fstream
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            map
          
          
            >
          
          
            

            

          
          
            const
          
          
          
          
            char
          
          
          
          
            *
          
          
            symbols 
          
          
            =
          
          
          
          
            "
          
          
            ~!@#$%^&*()_+-=[]\\{}|:\
          
          
            "
          
          
            ;
          
          
            '
          
          
            ,./<>?";
          
          
            

          
          
            const
          
          
          
          
            int
          
          
             MAX_SIZE 
          
          
            =
          
          
          
          
            100000
          
          
            ;
            

            

          
          
            bool
          
          
             WordFrequence::IsWhiteChar(
          
          
            const
          
          
          
          
            char
          
          
             chr){
            

          
          
            switch
          
          
             (chr){
            

          
          
            case
          
          
          
          
            '
          
          
            \t
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
            \r
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
            \n
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
            \0
          
          
            '
          
          
            :
            

          
          
            return
          
          
          
          
            true
          
          
            ;
            

          
          
            default
          
          
            : 
            

          
          
            return
          
          
          
          
            false
          
          
            ;
            

             }
            

            }
            

            

          
          
            void
          
          
             WordFrequence::LoadFromFile(){
            

             std::ifstream 
          
          
            is
          
          
            (file_name_, std::fstream::
          
          
            in
          
          
            );
            

          
          
            if
          
          
            (
          
          
            !
          
          
            is
          
          
            )
            

             std::cerr
          
          
            <<
          
          
            "
          
          
            error: can't open file: 
          
          
            "
          
          
            <<
          
          
            "
          
          
            [
          
          
            "
          
          
            <<
          
          
            file_name_
          
          
            <<
          
          
            "
          
          
            ]
          
          
            "
          
          
            <<
          
          
            std::endl;
            

             text 
          
          
            =
          
          
          
          
            new
          
          
          
          
            char
          
          
            [MAX_SIZE];
            

          
          
            is
          
          
            .read(text, MAX_SIZE);
            

            }
            

            

          
          
            void
          
          
             WordFrequence::parse(){
            

             word_frequence_map_.clear();
            

          
          
            int
          
          
             index
          
          
            =
          
          
            0
          
          
            ;
            

          
          
            int
          
          
             count 
          
          
            =
          
          
             strlen(text);
            

             std::
          
          
            string
          
          
             str;
            

          
          
            while
          
          
            (index 
          
          
            <
          
          
             count){
            

          
          
            for
          
          
            (
          
          
            int
          
          
             i
          
          
            =
          
          
            index; i
          
          
            <=
          
          
            count; 
          
          
            ++
          
          
            i){
            

          
          
            if
          
          
            (IsWhiteChar(text[i])){
            

          
          
            int
          
          
             len
          
          
            =
          
          
            i
          
          
            -
          
          
            index 
          
          
            +
          
          
          
          
            1
          
          
            ;
            

          
          
            char
          
          
          
          
            *
          
          
            p 
          
          
            =
          
          
          
          
            new
          
          
          
          
            char
          
          
            [len];
            

             memcpy(p, text
          
          
            +
          
          
            index, i
          
          
            -
          
          
            index);
            

             p[len
          
          
            -
          
          
            1
          
          
            ] 
          
          
            =
          
          
          
          
            '
          
          
            \0
          
          
            '
          
          
            ;
            

             str
          
          
            =
          
          
            p;
            

          
          
            ++
          
          
            word_frequence_map_[str];
            

             index 
          
          
            =
          
          
             i
          
          
            +
          
          
            1
          
          
            ;
            

          
          
            while
          
          
            (IsWhiteChar(text[index]))
            

          
          
            ++
          
          
            index;
            

          
          
            break
          
          
            ;
            

             }
            

             }
            

             }
            

            }
            

            

          
          
            void
          
          
             WordFrequence::ReplaceSymbol(){
            

          
          
            int
          
          
             j
          
          
            =
          
          
            0
          
          
            ;
            

          
          
            while
          
          
            (
          
          
            *
          
          
            (text
          
          
            +
          
          
            j) 
          
          
            !=
          
          
          
          
            '
          
          
            \0
          
          
            '
          
          
            ){
            

          
          
            for
          
          
            (
          
          
            int
          
          
             i
          
          
            =
          
          
            0
          
          
            ; i
          
          
            <
          
          
            strlen(symbols); 
          
          
            ++
          
          
            i){
            

          
          
            if
          
          
            (
          
          
            *
          
          
            (text
          
          
            +
          
          
            j)
          
          
            ==
          
          
            symbols[i])
            

          
          
            *
          
          
            (text
          
          
            +
          
          
            j)
          
          
            =
          
          
            '
          
          
          
          
            '
          
          
            ;
            

             }
            

             j
          
          
            ++
          
          
            ;
            

             }
            

            }
            

            

            std::ostream
          
          
            &
          
          
          
          
            operator
          
          
            <<
          
          
            (std::ostream
          
          
            &
          
          
             os, 
          
          
            const
          
          
             WordFrequence
          
          
            &
          
          
             wf){
            

             os
          
          
            <<
          
          
            "
          
          
            word\t\tfrequence
          
          
            "
          
          
            <<
          
          
            std::endl;
            

             os
          
          
            <<
          
          
            "
          
          
            -----------------------
          
          
            "
          
          
            <<
          
          
            std::endl;
            

             std::map
          
          
            <
          
          
            std::
          
          
            string
          
          
            , 
          
          
            int
          
          
            >
          
          
            ::const_iterator i_begin 
          
          
            =
          
          
             wf.word_frequence_map_.begin();
            

             std::map
          
          
            <
          
          
            std::
          
          
            string
          
          
            , 
          
          
            int
          
          
            >
          
          
            ::const_iterator i_end 
          
          
            =
          
          
             wf.word_frequence_map_.end();
            

          
          
            while
          
          
            (i_begin 
          
          
            !=
          
          
             i_end){
            

             os
          
          
            <<
          
          
            ""
          
          
            <<
          
          
            i_begin
          
          
            ->
          
          
            first
          
          
            <<
          
          
            "
          
          
            \t\t
          
          
            "
          
          
            <<
          
          
            i_begin
          
          
            ->
          
          
            second
          
          
            <<
          
          
            ""
          
          
            <<
          
          
            std::endl;
            

          
          
            ++
          
          
            i_begin;
            

             }
            

          
          
            return
          
          
             os;
            

            }

      
                
          #include 
        
        
          <
        
        
          iostream
        
        
          >
        
        
          

          #include 
        
        
          "
        
        
          word_frequence.h
        
        
          "
        
        
          

        
        
          using
        
        
        
        
          namespace
        
        
           std;
          

          

        
        
          int
        
        
           main(
        
        
          int
        
        
           argc, 
        
        
          char
        
        
          *
        
        
           argv[])
          

          {
          

           WordFrequence wf(
        
        
          "
        
        
          d:\\test.txt
        
        
          "
        
        
          );
          

        
        
          return
        
        
        
        
          0
        
        
          ;
          

          }

　　實(shí)現(xiàn)的方式很簡單，首先把從文件里load出text，然后去掉里面的符號(hào)，最后掃描一遍文件，遇著單詞就塞到map

里面.

      ++word_freq_map["word"];

這句話太好用了。一句話實(shí)現(xiàn)插入map，如果有就增加value，如果沒有就插入。

　　這個(gè)程序簡單訓(xùn)練了一下map容器的使用方法，也用到文件的讀取。注意ostream open以后一定要判斷open

成功了沒有。ostream有幾種讀取方式，有格式化的>>讀取，也有g(shù)etline這種一行讀取的，也有g(shù)etchar這種一個(gè)字符

讀一次的。也有read這種一次讀一大段二進(jìn)制的。讀的時(shí)候一定要對(duì)文件內(nèi)容有先驗(yàn)知識(shí)。

　　如果一次讀的數(shù)據(jù)量很大，建議read來讀取，效率很高，用循環(huán)讀取可能效率很低。

統(tǒng)計(jì)英文文本中的詞頻

更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系： 360901061

您的支持是博主寫作最大的動(dòng)力，如果您喜歡我的文章，感覺我的文章對(duì)您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點(diǎn)擊下面給點(diǎn)支持吧，站長非常感激您！手機(jī)微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點(diǎn)擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】元

2元

5元

10元

20元

自定義

欧美三区_成人在线免费观看视频_欧美极品少妇xxxxⅹ免费视频_a级毛片免费播放_鲁一鲁中文字幕久久_亚洲一级特黄