Data Mining in Finance (Multiple Features & Leading Stocks based Prediction)

MPhil Thesis Defence


Title: "Data Mining in Finance (Multiple Features & Leading Stocks based 
Prediction)"

By

Mr. Pong-Ching Wong


Abstract

With rapid digitalization of mass media and successive evolution of 
computational modeling, applying data mining techniques in stock 
forecasting has been a prevailing topic that is being explored by numerous 
computer scientists. Presently, researchers focus on studying how to make 
use of more powerful modeling techniques to improve the forecasting 
accuracies, relatively few researchers consider the essence of correct 
features selection (effective factors identification).

In the financial market, whether the stock price is going up or going down 
is mainly driven by the expectation of investors and speculators. The 
correct problem formulation achieved by considering effective factors is 
superior to making improvement on model complexities. In the real 
financial world, most of the investment decisions are relying on textual 
information (for instance, news articles) and human perceptions (trends of 
stock price) to predict market price movement. Most of these features, 
especially textual information, are unstructured and fuzzy in nature, 
unable to be readily processed in computational models and this also 
remains lots of technical challenges in the literature. In this thesis 
paper, we would like to propose feasible solutions to process and consider 
them in our cutting edge forecasting models. A support vector machine 
(SVM) based prediction system, multiple feature based forecasting 
framework (MFF) has been developed to process news, stock trend 
components, and technical indicators as well as Volatility Index (VIX) 
index in order to improve the prediction performance. At the same time, 
based on the shortcoming of current literature approaches in text mining 
and trend components extraction, some possible methods are also discussed 
and verified.

Apart from processing and utilizing additional features, the improvement 
of stock forecasting accuracies is also attempted through studying 
inter-relationships among multiple financial time series, traditionally 
believed to be useful for investors to optimize portfolios, and 
speculators to capture the chance of statistical arbitrage (most likely, 
pair trading). Existing researches, advocate of utilizing statistical 
correlations (co-movement in prices) between different stock entities as a 
factor mining method. However, the high correlation in the financial time 
series does not imply the causality. Therefore, we should reasonably 
address the later rather than the former. Notwithstanding the foreseeable 
improvement through modeling causalities, relatively few works are 
concerned with studying it and therefore explore the potential of lagging 
effects to boost accuracies of stock prediction. In this thesis paper, we 
would like to propose a novel leading stock based prediction framework 
(LSPF), dedicated to mining leading stocks. By definition in this study, a 
stock is considered as a leader once its rising or falling is preceded to 
others. In other words, the predictive power of any data modeling over led 
stock can be arisen by considering these leading stocks as factors in the 
modeling process. LSPF tracks the inter-leading and lagging relationships 
between stock entities by investigating three feasible leading stock 
mining models, respectively, linear Granger causality test, non-linear 
Granger causality test, and lagged correlation measurement. A leadership 
ranking approach is suggested to weight the importance of found leading 
and lagging stocks after mining processes.

In studies of multiple features, our extensive experiments, with use of 
the Dow Jones consistent stock daily basis data in the New York Exchange 
(NYSE), show that our approaches with additional features obviously 
outperform those with price and volume only. More importantly, a 
profitable simulation trading result is gained (reaching over 200% annual 
return on several stock entities, in comparison with the same period Dow 
Jones Index performance, -25%) during the sub-prime mortgage crisis, 
justifying the effectiveness and robustness of our system against the 
economic depression.

On the other hands, LSPF is evaluated in terms of its boosted accuracies 
over different prediction models, including neural network (NN) and 
support vector regression (SVR). Examined by the high frequency market 
microstructure data in the Hong Kong Stock Exchange (HKEX), it has shown 
that the LSPF is robust to volatile stock markets with its promising 
improvement in prediction accuracies, which confirms the presence and 
significance of leading stocks.


Date:			Friday, 18 June 2010

Time:			2:00pm – 4:00pm

Venue:			Room 3501
 			Lifts 25/26

Committee Members:	Dr. Lei Chen (Supervisor)
 			Prof. Frederick Lochovsky (Chairperson)
 			Dr. Raymond Wong


**** ALL are Welcome ****