More about HKUST
Tag Prediction for Posts on StackExchange Sites
The Hong Kong University of Science and Technology Department of Computer Science and Engineering Title: "Tag Prediction for Posts on StackExchange Sites" by Mr. LI, Tao Abstract: In this project, we tackle the problem of automatically predicting the tags of user posts on the StackExchange website according to their topics. By making use of the training data, namely, user posts with correct tags, we aim to predict the appropriate tags for new user posts. Although it is natural to model it as a multi-label classification problem by treating the tags as class labels, most existing methods suffer from performance and over-fitting problems when dealing with large-scale data sets, especially with a large number of candidate tags. This report proposes a K-Nearest-Neighbor (KNN) like text auto-tagging approach for multi-label classification. By extracting similar documents to form a candidate set and scoring the tags occurred in the candidate set based on the similarity between documents, the proposed method scales well to very large data sets. In essence, the similar documents in the candidate set vote for the tags in a way somewhat similar to the KNN classifier. To efficiently and effectively generate the candidates from the training data for each test instance, an inverted-index is built with the n-gram model for document representation. Experiments conducted on the StackExchange posts show that this approach is not only computationally capable of dealing with millions of text documents with modest memory usage on modern computer hardware, but it also achieves reasonably good prediction quality. Date : 2 May 2014 (Fri) Time : 10:30am to 11:30pm Venue : 5561 (lift 27) Advisor : Prof. Dit-Yan YEUNG 2nd Reader : Prof. Nevin L. ZHANG