一、命名实体识别Named Entity Recognition(NER)
NE Type | Examples |
---|---|
组织ORGANIZATION | Georgia-Pacific Corp., WHO |
人物PERSON | Eddy Bonte, President Obama |
地点LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
百分数PERCENT | twenty pct, 18.75 % |
设施FACILITY | Washington Monument, Stonehenge |
政治地缘实体GPE | South East Asia, Midlothian |
s="""The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta."""
s_w=nltk.word_tokenize(s) #分词 s_tag=nltk.pos_tag(s_w) #POS 标注 print(nltk.ne_chunk(s_tag)) #ne_chunk命名实体识别函数 #print(nltk.ne_chunk(s_tag, binary=True)) #binary=True,则实体都显示为NE,否则显示具体类别
练习:根据上例,完成下面文本的NER。
Guangdong University of Foreign Studies (GDUFS) is a major internationalized university in South China for its global-minded faculty/students and its research on international languages, literature, culture, trade and strategic studies.
Dating back to 1965 when the Guangzhou Institute of Foreign Languages was established and 1980 when the Guangzhou Institute of Foreign Trade was founded, the University had its present form by merging the two in 1995, with the Guangdong College of Finance and Economics incorporated into the University in 2008. The University has three campuses with a total area of 153 hectares: the North Campus at the foot of the Baiyun Mountain, the South Campus in Guangzhou Higher Education Mega Center, and Dalang Campus.
二、关系抽取
如果命名实体被确定后,就可以实现关系抽取来提取信息。一种方法是:寻找所有的三元组(X,a,Y)。其中X和Y是命名实体,a是表示两者关系的字符串,示例如下:
import nltk, re
IN = re.compile(r'.*\bin\b') #预先设定好正则表达式,匹配单词in
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
print(nltk.sem.rtuple(rel))
三、BosonNLP
https://bosonnlp.com/
中文语义开放平台