在python中使用NLTK的词性标记和停止词-yiteyi-C++库

自然语言工具包（NLTK）是一个用于构建文本分析程序的平台。NLTK模块的一个更强大的方面是词性标记。为了运行下面的python程序，必须安装NLTK。请按照安装步骤操作。

null

打开终端，快跑 pip安装nltk .
在命令提示符下编写python，这样python Interactive Shell就可以执行代码/脚本了。
类型 导入nltk
nltk。下载（）

将弹出一个GUI，然后选择下载所有软件包的“全部”，然后单击“下载”。这将为您提供所有的标记器、分块器、其他算法和所有的语料库，这就是为什么安装需要花费大量时间的原因。例如：

import nltknltk.download()

让我们快速写出一些词汇： 语料库： 正文，单数。Corpora是这个的复数形式。 词汇： 词语及其含义。 代币： 每一个“实体”都是根据规则分割的部分。在语料库语言学中， 词性标注 ( 词性标注 或 词性标注 或邮递 )，也叫 语法标记 或 词类消歧 .

Input: Everything is all about money.Output: [('Everything', 'NN'), ('is', 'VBZ'),           ('all', 'DT'),('about', 'IN'),           ('money', 'NN'), ('.', '.')]

以下是标签列表、它们的含义和一些示例：

协调连词 CD基数数字 DT测定器前存在主义（比如：“有”……把它想象成“有存在”）外来词在介词/从属连词中 JJ形容词——“大” JJR形容词，比较级——“更大” JJS形容词，最高级——“最大” LS列表标记1） MD modal–可以，威尔 NN名词，单数’-desk’ NNS名词复数——“课桌” NNP专有名词，单数——“哈里森” NNPS专有名词，复数——“美国人” PDT predeterminer——“所有的孩子” POS所有格结束父母的 PRP人称代词——我、他、她 PRP$所有格代词–我的，他的，她的 RB副词——非常安静， RBR副词，比较级——更好 RBS副词，最高级–最佳 RP粒子-放弃去——去商店。呃，感叹词——呃 VB动词，基本形式-take VBD动词，过去时-take VBG动词，动名词/现在分词-带动词，过去分词 VBP动词，歌唱。呈现，非3d–拍摄 VBZ动词，第三人称歌唱。现在-需要 WDT wh限定词–哪个 WP-wh代词–谁，什么 WP$所有格wh代词，如-which WRB wh abverb，例如-where，when

文本可能包含“the”、“is”、“are”等停止词。可以从要处理的文本中筛选停止词。nlp研究中没有通用的停止词列表，但是nltk模块包含一个停止词列表。你可以添加自己的停止词。去你的NLTK下载 目录路径 -> 语料库 -> 停止语 ->更新停止词文件取决于你使用的语言。这里我们使用英语（stopwords.words（’english’））。

python

                         import                                     nltk                       
                         from                                     nltk.corpus                                     import                                     stopwords                       
                         from                                     nltk.tokenize                                     import                                     word_tokenize, sent_tokenize                       
                         stop_words                                     =                                     set                                     (stopwords.words(                                     'english'                                     ))                       
           
                         /                                     /                                     Dummy text                       
                         txt                                     =                                     "Sukanya, Rajib                                     and                                     Naba are my good friends. "                       
                                                 "Sukanya                                     is                                     getting married                                     next                                     year. "                       
                                                 "Marriage                                     is                                     a big step                                     in                                     one’s life."                       
                                                 "It                                     is                                     both exciting                                     and                                     frightening. "                       
                                                 "But friendship                                     is                                     a sacred bond between people."                       
                                                 "It                                     is                                     a special kind of love between us. "                       
                                                 "Many of you must have tried searching                                     for                                     a friend "                       
                                                 "but never found the right one."                       
           
                         # sent_tokenize is one of instances of                       
                         # PunktSentenceTokenizer from the nltk.tokenize.punkt module                       
           
                         tokenized                                     =                                     sent_tokenize(txt)                       
                         for                                     i                                     in                                     tokenized:                       
                                   
                                                 # Word tokenizers is used to find the words                       
                                                 # and punctuation in a string                       
                                                 wordsList                                     =                                     nltk.word_tokenize(i)                       
           
                                                 # removing stop words from wordList                       
                                                 wordsList                                     =                                     [w                                     for                                     w                                     in                                     wordsList                                     if                                     not                                     w                                     in                                     stop_words]                       
           
                                                 #  Using a Tagger. Which is part-of-speech                       
                                                 # tagger or POS-tagger.                       
                                                 tagged                                     =                                     nltk.pos_tag(wordsList)                       
           
                                                 print                                     (tagged)                       

输出：

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')][('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')][('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')][('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')][('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')][('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')][('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), ('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]

基本上，词性标记的目标是将语言（主要是语法）信息分配给次句子单位。这种单位称为记号，大多数情况下与单词和符号（例如标点符号）相对应 .

文章版权归作者所有，未经允许请勿转载。

THE END

Python