RESEARCH INTERESTS OF MY GROUP

"We are like dwarfs on the shoulders of giants, so that we can see more than they, and things at a greater distance." (John of Salisbury)
 

Disclaimer: To facilitate better understanding of our work, we make some of our papers embedded in hyperlinks. However, it is no guarantee that they are the final version or the published version. Copyright and all rights are retained by authors or by other copyright holders.

 

My academic background is in the field of databases and my research generally aims to explore big data analytics in applications and their underlying implementation techniques. In particular, the large distributed graph computation techniques make substantial impacts [1-2] in the field and they support the cutting-edge technologies on big data (big graph data) applications. The application scope of big graph data includes social media [8-9] that supports collaborative web search and social network-based question answering. We also have interests to study the fundamental hashing techniques related to big data analytics [10-11].

 

Distributed Large Graph Computation:

 

Large graph processing in recommendation systems and knowledge learning are common in real-life big data applications today. However, the challenges are that large graphs are computationally expensive and the query evaluation on large graphs is not practically efficient. The work [1] significantly improves a well-known distributed computation system called Pregel by using the messages reduction techniques. We develop a new open source system, called Quegel, for querying big graphs [2]. Quegel treats queries as first-class citizens in its design but its computation is light-weight, in the sense that only a small portion of the input graph needs to be accessed. We incorporate new graph-based strategy which mines order-preserving co-expressed microarray gene information as evidenced in [5,6]. The mentioned work is related to graph fundamentals in handling big bioinformatic data. We also study efficient hashing schemes for data in high dimensional space [10-11]. The work tackles two problems as follows. First, the problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many big data applications, such as large image databases. Another problem of the c-Approximate Furthest Neighbor (c-AFN) search is fundamental to process big data analytic applications.

 

Social Media Applications:

 

We envision that large graph techniques can be applied to support a wide range of social media applications. Thus, we further exploit the knowledge graph to study meta-learning that solves the hypernymy detection problem [4]. We tackle an interesting big data application concerning how to capture users' precise preferences modeled as a large graph, which is a fundamental problem in network search. We further exploit the social network to study a unified system that solves the team formation problem in [9]. We also incorporate new network strategy which maintains semantic relation network derived from ambiguous web content in [8]. In [7] we employ users’ social networks for inferring user model, and thus improve the performance of expert finding in Community-based Question Answering systems.

 

Data Mining (Mainly done before 2010):

 

Data Mining can be viewed as querying on a database in order to identify relationship of data. Advanced query languages that support effective mining and searching are also our interests.

 

I   Mining Frequent Itemsets and Association Rules

It is well-recognized that the main factor that hinders the applications of Frequent Itemsets (FIs for short) and Association Rules (ARs for short) is the huge number of FIs and ARs returned by the mining process (see journal paper \ref{kais06}).

We proposed effective solutions that present concise mining results by eliminating the redundancy in both the set of FIs and the set of ARs (see journal papers \ref{dami07}, \ref{jiis07} and conference papers \ref{icdm06}, \ref{pakdd06}).

First, we studied how to eliminate the redundancy in the set of FIs. Two major works in this direction are Closed FIs (CFIs for short) and Maximal FIs (MFIs for short). The set of CFIs is a lossless representation of FIs but the number of CFIs is still too large, while the number of MFIs is small but MFIs lose the frequency information of FIs. We proposed δ-Tolerance CFIs (δ-TCFIs) to bridge the gap between the lossless CFIs and the lossy MFIs. The notion of δ-tolerance relaxes the restrictive definition of CFIs to remove a significantly greater amount of redundancy, while it still retains the frequency information of FIs.

Second, we developed an effective solution that eliminates the redundancy in the set of ARs based on the set of δ-TCFIs, proposed δ-Tolerance ARs (δ-TARs for short), and devised a set of inference rules. We established the interesting results that the set of δ-TARs is a non-redundant representation of ARs, while the set of ARs that is derived from the δ-TARs by the inference rules is sound and complete.


II  Mining in Quantitative Databases and Graph Databases

We are interested in adopting information-theoretic and statistical approaches to tackling mining problems in quantitative and graph databases (see journal papers \ref{tods08}, \ref{kais07b}, conference papers \ref{kdd07}, \ref{kdd06} and the synthetic graph data generator GrapGen).

Mining associations in quantitative databases is known to be too expensive to be practical. We tackled this long-standing problem in the field by mining correlations from quantitative databases and showed that it is a more effective approach than mining associations. We proposed a new notion of Quantitative Correlated Patterns (QCPs), which is founded on two formal concepts, mutual information and all-confidence (see journal paper \ref{tods08a}).

First, we devised a normalized form of mutual information and applied it to QCP mining to capture the dependency between the attributes. The notion of all-confidence is employed as a quality measure to control, at a finer granularity, the dependency between the attributes with specific quantitative intervals. Second, we adopted a supervised method to combine the consecutive intervals of the quantitative attributes based on mutual information, such that the interval combining is guided by the dependency between the attributes.

In addition to tackling the mining problem in quantitative databases, we proposed a new problem of correlation mining in the context of graph databases, called Correlated Graph Search (CGS) (see journal paper \ref{tkde08a}).

CGS adopts Pearson's correlation coefficient as a correlation measure to take into consideration the occurrence distributions of graphs. However, the problem poses significant challenges, since every subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derived necessary conditions that set bounds on the occurrence probability of a candidate in the database. With this result, an efficient algorithm that operates on a much smaller projected database was devised and thus a significantly smaller set of candidates was obtained. To further improve the efficiency, we developed heuristic rules and applied them to the candidate set to further reduce the search space.

 

References:

 

[1] Da Yan, James Cheng, Lu Yi, Wilfred Ng: Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation.  WWW  2015: 1307-1317 (2015).

[2] Da Yan, James Cheng, M. Tamer Özsu, Fan Yang, Yi Lu, John C. S. Lui, Qizhen Zhang, Wilfred Ng: A General-Purpose Query-Centric Framework for Querying Big Graphs. PVLDB 9(7): 564-575 (2016).

[3] Ji Cheng, Da Yan, Xiaotian Hao, Wilfred Ng: Mining Order-Preserving Submatrices Under Data Uncertainty: A Possible-World Approach. ICDE 2019: 1154-1165 (2019).

[4] Changlong Yu, Jialong Han, Haisong Zhang, Wilfred Ng: Hypernymy Detection for Low-Resource Languages via Meta Learning. ACL 2020: 3651-3656 (2020).

[5] Q. Fang, D. Su, Wilfred Ng and J. Feng. An Effective Biclustering-Based Framework for Identifying Cell Subpopulations From scRNA-seq Data. IEEE ACM Trans. Comput. Biol. Bioinform. 18(6): 2249-2260 (2021).

[6] J. Cheng, D. Yan, W. Qu, X. Hao, C. Long and Wilfred Ng. Mining OPSMs Under Data Uncertainty: A Possible-World Approach. ACM Trans. Database Syst. 47(2): 7:1-7:57 (2022).

[7] Zhou Zhao, Lijun Zhang, Xiaofe Hei, Wilfred Ng: Expert Finding for Question Answering via Graph Regularized Matrix Completion.  IEEE TKDE · 27(4): 993-1004 (2015).

[8] Kenneth Wai-Ting Leung, Di Jiang, Dik Lun Lee, Wilfred Ng: Constructing Maintainable Semantic Relation Network from Ambiguous Concepts in Web Content. ACM Trans. Internet Tech. 16(1): 6:1-6:23 (2016).

[9] Xinyu Wang, Zhou Zhao, Wilfred Ng: USTF: A Unified System of Team Formation. IEEE Trans. Big Data 2(1): 70-84 (2016).

[10] Qiang Huang, Jianlin Feng, Qiong Fang, Wilfred Ng: Two Efficient Hashing Schemes for High-Dimensional Furthest Neighbor Search.  IEEE TKDE · 29(12): 2772-2785 (2017).

[11] Qiang Huang, Jianlin Feng, Qiong Fang, Wilfred Ng, Wei Wang, Query-aware locality-sensitive hashing scheme for lp norm.  VLDB J. 26(5): 683-708 (2017).