Knowledge Base Population from External Data Sources

PhD Thesis Proposal Defence


Title: "Knowledge Base Population from External Data Sources"

by

Miss Xueling LIN


Abstract:

Nowadays, a lot of openly available knowledge bases (KBs) are constructed to 
facilitate the knowledge-centric applications, such as search engines and 
online recommendations. However, most openly available knowledge bases (KBs) 
are incomplete, since they are not synchronized with the emerging facts 
happening in the real world. Therefore, knowledge base population (KBP) from 
external data sources, which extracts knowledge from unstructured text to 
populate KBs, becomes a vital task. Recent research proposes two types of 
solutions that partially address this problem, but the performance of these 
solutions is limited. The first solution, dynamic KB construction from 
unstructured text, requires specifications of which predicates are of interest 
to the KB, which needs preliminary setups and is not suitable for an in-time 
population scenario. The second solution, Open Information Extraction (Open IE) 
from unstructured text, has limitations in producing facts that can be directly 
linked to the target KB without redundancy and ambiguity.

In this proposal, we investigate the end-to-end KBP task from unstructured text 
in external data sources with the support of Open IE, which contains two major 
research problems. First, we study the knowledge fusion problem, which targets 
at determining the most complete and accurate aggregated facts from diverse and 
conflicting data sources. We propose DART, an integrated Bayesian approach 
which comprehensively incorporates the domain expertise of the data sources, to 
infer the multiple possible truths of a fact. Second, we investigate the 
knowledge linking problem, which disambiguates the entities and relations 
extracted in the facts jointly, and links them to the existing concepts in the 
current KBs. We propose KBPearl as a solution under the global coherence 
assumption that all the entities and predicates mentioned in the same 
short-text document are densely related to each other. Specifically, we employ 
a semantic graph-based approach to capture the knowledge in the source 
document, and to determine the best linking results by finding the densest 
subgraph effectively and efficiently. Moreover, we also propose TENET as a 
solution under the sparse coherence assumption that not every pair of entities 
or predicates in a long-text document is strongly related to each other. 
Specifically, we formulate the joint entity and relation linking task as a 
minimum-cost coherence rooted tree cover problem, and propose approximation 
algorithms with pruning strategies to address this problem.

We demonstrate the effectiveness and efficiency of the proposed solutions of 
each of the above problems against the state-of-the-art techniques, through 
extensive experiments on real-world datasets. In the end, we conclude the 
thesis proposal with future research directions and challenges related to the 
KBP task.


Date:			Wednesday, 31 March 2021

Time:                  	4:00pm - 6:00pm

Zoom Meeting:		https://hkust.zoom.us/j/5266517832

Committee Members:	Prof. Lei Chen (Supervisor)
  			Dr. Yangqiu Song (Chairperson)
 			Dr. Qiong Luo
 			Prof. Raymond Wong


**** ALL are Welcome ****