More about HKUST
Financial Market Predictions using Web Mining Approaches
MPhil Thesis Defence Title: "Financial Market Predictions using Web Mining Approaches" By Mr. Yao Ma Abstract There has been a lot of research on the application of data mining and knowledge discovery technologies into financial market prediction area. However, most of the existing research focused on mining structured or numeric data such as financial reports, historical quotes, etc. Another kind of data source – unstructured data such as financial news articles, comments on financial markets by experts, etc., which is usually of a much higher availability, seems to be neglected due to their inconvenience to be represented as numeric feature vectors for further applying data mining algorithms. With text preprocessing (document representation) technologies, this thesis makes use of this kind of data, specifically financial news articles, to apply data mining in financial market predictions. A web-based system has been developed for this purpose. It retrieves financial news articles from the internet periodically and using text mining techniques to categorize those articles into different categories according to their expected effects on the market behaviors, then the results will be compared with the real market data. The system allows the users to select different algorithms for each phase of the text mining process, so that the results for different combinations of algorithms can be compared and the best one can be selected by observing the results. This combination of algorithms can be applied to do financial market prediction in the future. The text mining process has three phases totally, keyword extraction, keyword weighting and classification. Keyword extraction is to extract a keyword list from a corpus, according to the ability of each word to distinguish the category of a document from others. The system has implemented the following keyword extraction algorithms: document frequency threshold, entropy method, information gain, gain ratio, chi-square statistic and mutual information. Keyword weighting is to transform a document into a numeric feature vector according to a keyword list generated previously. Each word in the keyword list will be assigned a weight according to the number of occurrence of this keyword in a document. The system has implemented Boolean weighting, term frequency (TF) weighting, term frequency times inverse document frequency weighting (TFxIDF), LTC weighting and TFC weighting methods. For classification algorithms, this system has implemented Navie Bayes Classifier and Support Vector Machines, but the experiment is focused on the former classifier. The system collected news articles and market data and was tested to compare different algorithms for each phase. As there are a huge amount of combinations of different algorithms, we adopted a greedy approach to find out the optimized combination. Particularly, we vary the algorithms or parameters for one phase of the text mining, and fix all the others. Then by observing the results, select the best algorithm/parameter and assume that it is also the global optimized algorithm for this phase no matter how the algorithms/parameters of other phases vary. The results are presented and analyzed in this thesis for selecting the best combination of algorithms for text mining in financial market prediction. Date: Thursday, 20 August 2009 Time: 2:00pm – 4:00pm Venue: Room 3501 Lifts 25-26 Committee Members: Dr. David Rossiter (Supervisor) Dr. Jogesh Muppala (Chairperson) Dr. Lei Chen **** ALL are Welcome ****