Lesson 18 命名实体识别 & 关系抽取
来源: 陈仕鸿/
广东外语外贸大学
3421
0
0
2018-06-04

一、命名实体识别Named Entity Recognition(NER)

NE TypeExamples
组织ORGANIZATIONGeorgia-Pacific Corp.WHO
人物PERSONEddy BontePresident Obama
地点LOCATIONMurray RiverMount Everest
DATEJune2008-06-29
TIMEtwo fifty a m1:30 p.m.
MONEY175 million Canadian DollarsGBP 10.40
百分数PERCENTtwenty pct18.75 %
设施FACILITYWashington MonumentStonehenge
政治地缘实体GPESouth East AsiaMidlothian

s="""The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta."""

s_w=nltk.word_tokenize(s) #分词 s_tag=nltk.pos_tag(s_w)  #POS 标注 print(nltk.ne_chunk(s_tag)) #ne_chunk命名实体识别函数 #print(nltk.ne_chunk(s_tag, binary=True)) #binary=True,则实体都显示为NE,否则显示具体类别


练习:根据上例,完成下面文本的NER。

Guangdong University of Foreign Studies (GDUFS) is a major internationalized university in South China for its global-minded faculty/students and its research on international languages, literature, culture, trade and strategic studies. 

Dating back to 1965 when the Guangzhou Institute of Foreign Languages was established and 1980 when the Guangzhou Institute of Foreign Trade was founded, the University had its present form by merging the two in 1995, with the Guangdong College of Finance and Economics incorporated into the University in 2008. The University has three campuses with a total area of 153 hectares: the North Campus at the foot of the Baiyun Mountain, the South Campus in Guangzhou Higher Education Mega Center, and Dalang Campus.


二、关系抽取

如果命名实体被确定后,就可以实现关系抽取来提取信息。一种方法是:寻找所有的三元组(X,a,Y)。其中X和Y是命名实体,a是表示两者关系的字符串,示例如下:


import nltk, re

IN = re.compile(r'.*\bin\b') #预先设定好正则表达式,匹配单词in

for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):

     for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):

         print(nltk.sem.rtuple(rel))


三、BosonNLP  
https://bosonnlp.com/

中文语义开放平台


附件

登录用户可以查看和发表评论, 请前往  登录 或  注册
SCHOLAT.com 学者网
免责声明 | 关于我们 | 联系我们
联系我们: