问题:
unprecedented amount、diverse sources、heterogeneous formats
通过不同的方法从不同的来源获取数据
采用多种技术从数据中挖掘有用的信息
挖掘的信息进一步与现有的结构化数据集成(实体链接技术)
自然语言接口:TR Discover;将自然语言问题被转换成可执行的查询以进行答案检索
1) How to process and mine useful information from large amount of unstructured and structured data
2) How to integrate such mined information for the same entity across disconnected data sources and store them in a manner for easy and efficient access
3) How to quickly find the entities that satisfy the information needs of today’s knowledge workers
ingest and consume the data in a scalable manner(可伸缩的方式)This data ingestion process needs to be robust enough to be capable of processing all types of data
add structure to these free text documents(patent filings, financial reports, academic publications, etc)
cannot leave this data sitting in separated “silos” 集成数据
EntityRelationship(ER):技术成熟,但难以快速更新,只能keyword查询
RDF模型:灵活,三元组格式表示数据,没有固定的模式;RD允许建模数据的更具表现力的语义,可用于知识推理
关键字查询:前者不能准确表达用户的查询意图,尤其是涉及关系或其他限制的问题,如时间约束
专用查询语言(SQL和SPARQL2):有专业背景要求
structured data: link each entity in the data to the relevant nodes in our graph and update the information of the nodes being linked to. 将数据中的每个实体链接到图中的相关节点,并更新所链接节点的信息
unstructured data:first perform information extraction to extract the entities and their relationships with other entities; such extracted structured data is then integrated into our knowledge graph.首先进行信息抽取,提取出实体及其与其他实体的关系;然后将提取的结构化数据集成到知识图谱中
Named Entity Recognition:use natural language processing techniques that include both rule-based and machine learning algorithms.
Relation Extraction:machine learning classifier that predicts the probability of a possible relationship for a given pair of identified entities in a given sentence
matching the attribute values of the nodes in the graph and that of a new entity
RDF通常被描述为有向和有标记的图,但一组三元组,每一个三元组都以形式由主语、谓语和宾语组成。三元组存储在三元组存储区中,并使用SPARQL查询语言进行查询。用三元组表示数据需要一个模型(类似于关系数据库),但RDF支持丰富语义的表达并支持知识推理。采用RDF模型的另一大优点是它可以更容易地删除和更新数据。
index the triples on their subject, predicate and object respectively with the Elastic search engine.
build a full-text search index on objects that are literal values, where such literal values are tokenized and treated as terms in the index.
auto-suggest mechanism (help users to complete their questions)
将用户的自然语言mapping到中介语言,再将中介语言转化为 standard query language
步骤:Question Understanding --> Enabling Question Completion with Auto-suggest --> Question Translation and Execution
The FOL representation of a natural language question is further translated to an executable query
1 parse the FOL representation into a parse tree by using an FOL parser
2 then perform an in-order traversal of the FOL parse tree and translate it to an executable query
In this paper, we present our effort in building and querying Thomson Reuters’ knowledge graph. Data in heterogeneous formats is first acquired from various sources. We then develop named entity recognition, relation extraction and entity linking techniques for mining information from the data and integrating the mined data across different sources. We model and store our data in RDF triples, and present TR Discover that enables users to search for information with natural language questions. We evaluate and demonstrate the practicability of our knowledge graph. In future work, we would like to enhance our NLP algorithms in order to cover more domains. Also, rather than relying on a pre-defined grammar for understanding natural language questions, we will explore the possibility of developing a more flexible question parser. Finally, we will deploy our knowledge graph to more products and improve our various services according to customer feedback.