Financial Market Predictions using Web Mining Approaches

MPhil Thesis Defence


Title: "Financial Market Predictions using Web Mining Approaches"

By

Mr. Yao Ma


Abstract

There has been a lot of research on the application of data mining and 
knowledge discovery technologies into financial market prediction area. 
However, most of the existing research focused on mining structured or 
numeric data such as financial reports, historical quotes, etc. Another 
kind of data source – unstructured data such as financial news articles, 
comments on financial markets by experts, etc., which is usually of a much 
higher availability, seems to be neglected due to their inconvenience to 
be represented as numeric feature vectors for further applying data mining 
algorithms. With text preprocessing (document representation) 
technologies, this thesis makes use of this kind of data, specifically 
financial news articles, to apply data mining in financial market 
predictions.

A web-based system has been developed for this purpose. It retrieves 
financial news articles from the internet periodically and using text 
mining techniques to categorize those articles into different categories 
according to their expected effects on the market behaviors, then the 
results will be compared with the real market data. The system allows the 
users to select different algorithms for each phase of the text mining 
process, so that the results for different combinations of algorithms can 
be compared and the best one can be selected by observing the results. 
This combination of algorithms can be applied to do financial market 
prediction in the future.

The text mining process has three phases totally, keyword extraction, 
keyword weighting and classification. Keyword extraction is to extract a 
keyword list from a corpus, according to the ability of each word to 
distinguish the category of a document from others. The system has 
implemented the following keyword extraction algorithms: document 
frequency threshold, entropy method, information gain, gain ratio, 
chi-square statistic and mutual information. Keyword weighting is to 
transform a document into a numeric feature vector according to a keyword 
list generated previously. Each word in the keyword list will be assigned 
a weight according to the number of occurrence of this keyword in a 
document. The system has implemented Boolean weighting, term frequency 
(TF) weighting, term frequency times inverse document frequency weighting 
(TFxIDF), LTC weighting and TFC weighting methods. For classification 
algorithms, this system has implemented Navie Bayes Classifier and Support 
Vector Machines, but the experiment is focused on the former classifier.

The system collected news articles and market data and was tested to 
compare different algorithms for each phase. As there are a huge amount of 
combinations of different algorithms, we adopted a greedy approach to find 
out the optimized combination. Particularly, we vary the algorithms or 
parameters for one phase of the text mining, and fix all the others. Then 
by observing the results, select the best algorithm/parameter and assume 
that it is also the global optimized algorithm for this phase no matter 
how the algorithms/parameters of other phases vary. The results are 
presented and analyzed in this thesis for selecting the best combination 
of algorithms for text mining in financial market prediction.


Date:			Thursday, 20 August 2009

Time:			2:00pm – 4:00pm

Venue:			Room 3501
 			Lifts 25-26

Committee Members:	Dr. David Rossiter (Supervisor)
 			Dr. Jogesh Muppala (Chairperson)
 			Dr. Lei Chen


**** ALL are Welcome ****