RESEARCH INTERESTS OF MY GROUP
"We are like dwarfs on the shoulders of giants, so
that we can see more than they, and things at a greater distance." (John
of Salisbury)
Disclaimer: To facilitate better understanding of our work, we make some of our papers embedded in hyperlinks. However, it is no guarantee that they are the final version or the published version. Copyright and all rights are retained by authors or by other copyright holders.
My academic background is in the
field of databases and my research generally aims to explore big data analytics
in applications and their underlying implementation techniques. In particular,
the large distributed graph computation techniques make substantial impacts
[1-2] in the field and they support the cutting-edge technologies on big data
(big graph data) applications. The application scope of big graph data includes
social media [8-9] that supports collaborative web search and social
network-based question answering. We also have interests to study the
fundamental hashing techniques related to big data analytics [10-11].
Distributed Large Graph Computation:
Large graph processing in recommendation systems and knowledge learning are
common in real-life big data applications today. However, the challenges are that
large graphs are computationally expensive and the query evaluation on large
graphs is not practically efficient. The work [1] significantly improves a
well-known distributed computation system called Pregel by using the messages
reduction techniques. We develop a new open source system, called Quegel, for
querying big graphs [2]. Quegel treats queries as first-class citizens in its
design but its computation is light-weight, in the sense that only a small
portion of the input graph needs to be accessed. We incorporate new
graph-based strategy which mines order-preserving co-expressed microarray gene
information as evidenced in [5,6]. The mentioned work is related to graph
fundamentals in handling big bioinformatic data. We also study efficient hashing schemes for data in high dimensional
space [10-11]. The work tackles two problems as follows. First, the problem of
c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is
fundamentally important in many big data applications, such as large image
databases. Another problem of the c-Approximate Furthest Neighbor (c-AFN)
search is fundamental to process big data analytic applications.
Social Media Applications:
We envision that large graph
techniques can be applied to support a wide range of social media applications.
Thus, we further exploit the knowledge graph to study meta-learning that solves
the hypernymy detection problem [4]. We tackle an interesting big data
application concerning how to capture users' precise preferences modeled as a
large graph, which is a fundamental problem in network search. We further exploit the social network
to study a unified system that solves the team formation problem in [9]. We
also incorporate new network strategy which maintains semantic relation network
derived from ambiguous web content in [8]. In [7] we employ users’ social networks for inferring user model, and
thus improve the performance of expert finding in Community-based Question
Answering systems.
Data Mining (Mainly done before 2010):
Data Mining can be viewed as querying on a database in order to identify relationship of data. Advanced query languages that support effective mining and searching are also our interests.
I Mining
Frequent Itemsets and Association Rules
It is well-recognized that the main factor
that hinders the applications of Frequent Itemsets (FIs for short) and
Association Rules (ARs for short) is the huge number of FIs and ARs returned by
the mining process (see journal paper \ref{kais06}).
We proposed effective solutions that present concise mining results by
eliminating the redundancy in both the set of FIs and the set of ARs (see
journal papers \ref{dami07}, \ref{jiis07} and conference papers \ref{icdm06}, \ref{pakdd06}).
First, we studied how to eliminate the redundancy in the set of FIs. Two major
works in this direction are Closed FIs (CFIs for short) and Maximal FIs (MFIs
for short). The set of CFIs is a lossless representation of FIs but the number
of CFIs is still too large, while the number of MFIs is small but MFIs lose the
frequency information of FIs. We proposed δ-Tolerance CFIs (δ-TCFIs)
to bridge the gap between the lossless CFIs and the lossy MFIs. The notion of
δ-tolerance relaxes the restrictive definition of CFIs to remove a
significantly greater amount of redundancy, while it still retains the
frequency information of FIs.
Second, we developed an effective solution that eliminates the redundancy in
the set of ARs based on the set of δ-TCFIs, proposed δ-Tolerance
ARs (δ-TARs for short), and devised a set of inference rules. We
established the interesting results that the set of δ-TARs is a
non-redundant representation of ARs, while the set of ARs that is derived from
the δ-TARs by the inference rules is sound and complete.
II Mining in Quantitative Databases and Graph Databases
We are interested in adopting
information-theoretic and statistical approaches to tackling mining problems in
quantitative and graph databases (see journal papers \ref{tods08},
\ref{kais07b}, conference papers \ref{kdd07}, \ref{kdd06}
and the synthetic graph data generator GrapGen).
Mining associations in quantitative databases is known to be too expensive to
be practical. We tackled this long-standing problem in the field by mining
correlations from quantitative databases and showed that it is a more effective
approach than mining associations. We proposed a new notion of Quantitative
Correlated Patterns (QCPs), which is founded on two formal concepts, mutual
information and all-confidence (see journal paper \ref{tods08a}).
First, we devised a normalized form of mutual information and applied it to QCP
mining to capture the dependency between the attributes. The notion of
all-confidence is employed as a quality measure to control, at a finer
granularity, the dependency between the attributes with specific quantitative
intervals. Second, we adopted a supervised method to combine the consecutive
intervals of the quantitative attributes based on mutual information, such that
the interval combining is guided by the dependency between the attributes.
In addition to tackling the mining problem in quantitative databases, we
proposed a new problem of correlation mining in the context of graph databases,
called Correlated Graph Search (CGS) (see journal paper \ref{tkde08a}).
CGS adopts Pearson's correlation coefficient as a correlation measure to take
into consideration the occurrence distributions of graphs. However, the problem
poses significant challenges, since every subgraph of a graph in the database
is a candidate but the number of subgraphs is exponential. We derived necessary
conditions that set bounds on the occurrence probability of a candidate in the
database. With this result, an efficient algorithm that operates on a much
smaller projected database was devised and thus a significantly smaller set of
candidates was obtained. To further improve the efficiency, we developed
heuristic rules and applied them to the candidate set to further reduce the
search space.
References:
[1] Da Yan, James Cheng, Lu Yi, Wilfred Ng: Effective Techniques for Message Reduction
and Load Balancing in Distributed Graph Computation. WWW 2015: 1307-1317
(2015).
[2] Da Yan, James Cheng, M.
Tamer Özsu, Fan Yang, Yi Lu, John C. S. Lui, Qizhen Zhang, Wilfred Ng: A General-Purpose Query-Centric Framework for Querying
Big Graphs. PVLDB 9(7): 564-575 (2016).
[3] Ji Cheng, Da Yan, Xiaotian
Hao, Wilfred Ng: Mining
Order-Preserving Submatrices Under Data Uncertainty: A Possible-World Approach.
ICDE 2019: 1154-1165 (2019).
[4] Changlong Yu, Jialong Han,
Haisong Zhang, Wilfred Ng: Hypernymy
Detection for Low-Resource Languages via Meta Learning. ACL 2020: 3651-3656
(2020).
[5] Q.
Fang, D. Su, Wilfred Ng and J. Feng.
An Effective Biclustering-Based Framework for Identifying Cell Subpopulations
From scRNA-seq Data. IEEE ACM Trans. Comput. Biol. Bioinform. 18(6): 2249-2260
(2021).
[6] J.
Cheng, D. Yan, W. Qu, X. Hao, C. Long and Wilfred
Ng. Mining OPSMs Under Data Uncertainty: A Possible-World Approach. ACM
Trans. Database Syst. 47(2): 7:1-7:57 (2022).
[7] Zhou Zhao, Lijun Zhang, Xiaofe Hei, Wilfred
Ng: Expert Finding for Question Answering via Graph Regularized Matrix
Completion. IEEE TKDE · 27(4):
993-1004 (2015).
[8] Kenneth Wai-Ting Leung, Di
Jiang, Dik Lun Lee, Wilfred Ng: Constructing
Maintainable Semantic Relation Network from Ambiguous Concepts in Web Content.
ACM Trans. Internet Tech. 16(1): 6:1-6:23 (2016).
[9] Xinyu Wang, Zhou Zhao, Wilfred Ng: USTF: A Unified System of
Team Formation. IEEE Trans. Big Data 2(1): 70-84 (2016).
[10] Qiang Huang, Jianlin Feng,
Qiong Fang, Wilfred Ng: Two
Efficient Hashing Schemes for High-Dimensional Furthest Neighbor Search. IEEE TKDE · 29(12): 2772-2785 (2017).
[11] Qiang Huang, Jianlin Feng,
Qiong Fang, Wilfred Ng, Wei Wang,
Query-aware locality-sensitive hashing scheme for lp norm. VLDB J. 26(5): 683-708 (2017).