一、NLTK进行分词 用到的函数: nltk.sent_tokenize(text) #对文本按照句子进行分割 nltk.word_tokenize(sent) #对句子进行分词 二、NLTK进行词性标注 Universal Part-of-Speech Tagset 示例: 分词: “我/ 把/ 书/ 放/ 在/ 冰箱/ 上/” => 词性标注后:“我/n 把/v 书/n 放/v 在/v 冰箱/n 上/f” 缩写 英文 中文 说明 CC Coordinating conjunction 并列连接词 CD Cardinal number 基数 DT Determiner 限定词 EX Existential there 存在型there FW Foreign word 外文单词 IN Preposition/subord, conjunction 介词/从属,连接词 JJ Adjective 形容词 JJR Adjective, comparative 形容词,比较级 JJS Adjective, superlative 形容词,最高级 LS List item marker MD Modal 情态动词 NN Noun ,singular or NNS Noun, plural 名词,复数 NNP Proper noun, singular NNPS Proper noun, plural PDT Pre determiner 前位限定词 POS Possessive ending 所有格结束词 PRP Personal pronoun 人称代名词 PP$ Possessive pronoun 物主代词,所有格代名词 RB Adverb 副词 RBR Adverb, comparative 副词,比较级 RBS Adverb, superlative 副词,最高级 RP Particle 小品词 SYM Symbol(mathematical or scientific) TO to To UH Interjection 感叹词 VB Verb, base form ------------------------- 单独为特定的单词标注: tagged_token = nltk.tag.str2tuple('fly/NN') print(tagged_token) 练习1:给can标注为名词。 ------------------------------------------------------------ Using a Tagger 使用词性标注器 A part-of-speech tagger, or POS tagger nltk.pos_tag(tokens) #tokens是句子分词后的结果,同样是句子级的标注 ----------------------------------------------------------------------- 例1: sent = "And now for something completely different" words = nltk.word_tokenize(sent) sent_tag = nltk.pos_tag(words) print(sent_tag) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] 练习2:随便写几句介绍你自己的句子(英文),然后标注这些单词。 Tagged Corpora 标注语料库
http://www.cnblogs.com/yuxc/archive/2011/08/26/2155157.html
----------------------------------------------------------- 标注的简单运用:统计分析 1. 统计分析新闻文档中,什么类型的词用得最多 from nltk.corpus import brown brown_news_tagged = brown.tagged_words(categories='news', tagset='universal') tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) tag_fd.most_common() 2. 找出最频繁的名词标记 import nltk def findtags(tag_prefix, tagged_text): cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news')) for tag in sorted(tagdict): print(tag, tagdict[tag]) 三、Information Extraction 信息抽取 -------------------------------- OrgName LocationName Omnicom New York DDB Needham New York Kaplan Thaler Group New York BBDO South Atlanta Georgia-Pacific Atlanta ---------------------------------------- Companies that operate in Atlanta OrgName BBDO South Georgia-Pacific
------------------------------------------------------------------
如果信息蕴含于文本中
The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.
Information Extraction Architecture信息抽取过程示意图