Knowledge Base Population from External Data Sources

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Knowledge Base Population from External Data Sources"

By

Miss Xueling LIN


Abstract

Nowadays, a lot of openly available knowledge bases (KBs) are constructed 
to facilitate the knowledge-centric applications, such as search engines 
and online recommendations. However, most openly available KBs are 
incomplete, since they are not synchronized with the emerging facts 
happening in the real world. Therefore, knowledge base population (KBP) 
from external data sources, which extracts knowledge from unstructured 
text to populate KBs, becomes a vital task. Recent research proposes two 
types of solutions that partially address this problem, but the 
performance of these solutions is limited. The first solution, dynamic KB 
construction from unstructured text, requires specifications of which 
predicates are of interest to the KB, which needs preliminary setups and 
is not suitable for an in-time population scenario. The second solution, 
Open Information Extraction (Open IE) from unstructured text, has 
limitations in producing facts that can be directly linked to the target 
KB without redundancy and ambiguity.

In this thesis, we investigate the end-to-end KBP task from unstructured 
text in external data sources with the support of Open IE, which contains 
three major research problems. First, we address the knowledge 
canonicalization problem, which performs the canonicalization of the noun 
phrases and relational phrases in the Open IE triples jointly to remove 
the redundant and ambiguous facts. We propose SIST, an efficient 
canonicalization model leveraging the side information from the context of 
the original data sources. Second, we study the knowledge fusion problem, 
which targets at determining the most complete and accurate aggregated 
facts from diverse and conflicting data sources. We propose DART, an 
integrated Bayesian approach which comprehensively incorporates the domain 
expertise of the data sources, to infer the multiple possible truths of a 
fact. Third, we investigate the knowledge linking problem, which 
disambiguates the entities and relations extracted in the facts jointly, 
and links them to the existing concepts in the current KBs. We propose 
KBPearl as a solution under the global coherence assumption that all the 
entities and predicates mentioned in the same short-text document are 
densely related to each other. Specifically, we employ a semantic 
graph-based approach to capture the knowledge in the source document, and 
to determine the best linking results by finding the densest subgraph 
effectively and efficiently. Moreover, we also propose TENET as a solution 
under the sparse coherence assumption that not every pair of entities or 
predicates in a long-text document is strongly related to each other. 
Specifically, we formulate the joint entity and relation linking task as a 
minimum-cost rooted tree cover problem on the knowledge coherence graph 
constructed based on the document, and propose approximation algorithms 
with pruning strategies to address this problem and derive the linking 
results.

We demonstrate the effectiveness and efficiency of the proposed solutions 
of each of the above problems against the state-of-the-art techniques, 
through extensive experiments on real-world datasets. In the end, we 
conclude the thesis with future research directions and challenges related 
to the KBP task.


Date:			Monday, 31 May 2021

Time:			10:00am - 12:00noon

Zoom Meeting: 
https://hkust.zoom.us/j/91685764320?pwd=OThhYXFNTUVSQlcvekdZeEdzaUE2UT09

Chairperson:		Prof. Tiezheng QIAN (MATH)

Committee Members:	Prof. Lei CHEN (Supervisor)
 			Prof. Yangqiu SONG
 			Prof. Xiaofang ZHOU
 			Prof. Can YANG (MATH)
 			Prof. Jianliang XU (HKBU)


**** ALL are Welcome ****