Tag Prediction for Posts on StackExchange Sites

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Title: "Tag Prediction for Posts on StackExchange Sites"

by

Mr. LI, Tao

Abstract:

In this project, we tackle the problem of automatically predicting
the tags of user posts on the StackExchange website according to
their topics. By making use of the training data, namely, user posts
with correct tags, we aim to predict the appropriate tags for new
user posts. Although it is natural to model it as a multi-label
classification problem by treating the tags as class labels,
most existing methods suffer from performance and over-fitting
problems when dealing with large-scale data sets, especially
with a large number of candidate tags. This report proposes a
K-Nearest-Neighbor (KNN) like text auto-tagging approach for
multi-label classification. By extracting similar documents to
form a candidate set and scoring the tags occurred in the candidate
set based on the similarity between documents, the proposed method
scales well to very large data sets. In essence, the similar documents
in the candidate set vote for the tags in a way somewhat similar to
the KNN classifier. To efficiently and effectively generate the
candidates from the training data for each test instance, an
inverted-index is built with the n-gram model for document
representation. Experiments conducted on the StackExchange posts
show that this approach is not only computationally capable of
dealing with millions of text documents with modest memory usage
on modern computer hardware, but it also achieves reasonably
good prediction quality.

Date            :       2 May 2014 (Fri)

Time            :       10:30am to 11:30pm

Venue           :       5561 (lift 27)

Advisor         :       Prof. Dit-Yan YEUNG

2nd Reader      :       Prof. Nevin L. ZHANG