Machine Learning for Spam Detection

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


FYT Presentation and Demonstration


Title: "Machine Learning for Spam Detection"

by

Mr. Ho Wai Pang, Tony


Abstract

Spam mails not only annoy users but also bring unnecessary bandwidth
wastage to the Internet. This problem is posing a serious threat to email
service. Although many content-based spam filters have been developed,
they alone cannot solve the problem especially when spammers apply various
tricks these days to modify the content of spam mails to fool the filters.
In this study, we propose a server-side spam filter which plays a
complementary role to a content-based filter by filtering out some spam
mails at the server level based on complementary features other than those
extracted from the mail content. By using a na?ve Bayes classifier with
the reject option, we show that utilizing 18 features based on URL and
mail header information enables half of the emails to be classified with
low false positive rate (<1%). Also, we address the URL information hiding
problem in our study. Moreover, an online survey has been conducted to
understand the user preferences regarding the use of spam filters. The
survey results show that the maximum tolerance of missing legitimate
emails should never exceed 5%. The implications of the survey results to
our future research will also be discussed.


Date		:	28 April 2008, Monday

Time		:	10am to 11am

Venue		:	Room 3304

Advisor		: 	Dr. D.Y. Yeung

2nd Reader	:	Dr. Brian Mak