一、NLTK进行分词
用到的函数:
nltk.sent_tokenize(text) #对文本按照句子进行分割
nltk.word_tokenize(sent) #对句子进行分词
二、NLTK进行词性标注
Universal Part-of-Speech Tagset
示例:
分词: “我/ 把/ 书/ 放/ 在/ 冰箱/ 上/” =>
词性标注后:“我/n 把/v 书/n 放/v 在/v 冰箱/n 上/f”
缩写 英文 中文 说明
CC Coordinating conjunction 并列连接词
CD Cardinal number 基数
DT Determiner 限定词
EX Existential there 存在型there
FW Foreign word 外文单词
IN Preposition/subord, conjunction 介词/从属,连接词
JJ Adjective 形容词
JJR Adjective, comparative 形容词,比较级
JJS Adjective, superlative 形容词,最高级
LS List item marker
MD Modal 情态动词
NN Noun ,singular or
NNS Noun, plural 名词,复数
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Pre determiner 前位限定词
POS Possessive ending 所有格结束词
PRP Personal pronoun 人称代名词
PP$ Possessive pronoun 物主代词,所有格代名词
RB Adverb 副词
RBR Adverb, comparative 副词,比较级
RBS Adverb, superlative 副词,最高级
RP Particle 小品词
SYM Symbol(mathematical or scientific)
TO to To
UH Interjection 感叹词
VB Verb, base form
-------------------------
单独为特定的单词标注:
tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)
练习1:给can标注为名词。
------------------------------------------------------------
Using a Tagger 使用词性标注器
A part-of-speech tagger, or POS tagger
nltk.pos_tag(tokens) #tokens是句子分词后的结果,同样是句子级的标注
-----------------------------------------------------------------------
例1:
sent = "And now for something completely different"
words = nltk.word_tokenize(sent)
sent_tag = nltk.pos_tag(words)
print(sent_tag)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
练习2:随便写几句介绍你自己的句子(英文),然后标注这些单词。
Tagged Corpora 标注语料库http://www.cnblogs.com/yuxc/archive/2011/08/26/2155157.html
-----------------------------------------------------------
标注的简单运用:统计分析
1. 统计分析新闻文档中,什么类型的词用得最多
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()
2. 找出最频繁的名词标记
import nltk
def findtags(tag_prefix, tagged_text):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
print(tag, tagdict[tag])
三、Information Extraction 信息抽取
--------------------------------
OrgName LocationName
Omnicom New York
DDB Needham New York
Kaplan Thaler Group New York
BBDO South Atlanta
Georgia-Pacific Atlanta
----------------------------------------
Companies that operate in Atlanta
OrgName
BBDO South
Georgia-Pacific------------------------------------------------------------------
如果信息蕴含于文本中
The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.
Information Extraction Architecture信息抽取过程示意图
