More about HKUST
Data Mining in Finance (Multiple Features & Leading Stocks based Prediction)
MPhil Thesis Defence Title: "Data Mining in Finance (Multiple Features & Leading Stocks based Prediction)" By Mr. Pong-Ching Wong Abstract With rapid digitalization of mass media and successive evolution of computational modeling, applying data mining techniques in stock forecasting has been a prevailing topic that is being explored by numerous computer scientists. Presently, researchers focus on studying how to make use of more powerful modeling techniques to improve the forecasting accuracies, relatively few researchers consider the essence of correct features selection (effective factors identification). In the financial market, whether the stock price is going up or going down is mainly driven by the expectation of investors and speculators. The correct problem formulation achieved by considering effective factors is superior to making improvement on model complexities. In the real financial world, most of the investment decisions are relying on textual information (for instance, news articles) and human perceptions (trends of stock price) to predict market price movement. Most of these features, especially textual information, are unstructured and fuzzy in nature, unable to be readily processed in computational models and this also remains lots of technical challenges in the literature. In this thesis paper, we would like to propose feasible solutions to process and consider them in our cutting edge forecasting models. A support vector machine (SVM) based prediction system, multiple feature based forecasting framework (MFF) has been developed to process news, stock trend components, and technical indicators as well as Volatility Index (VIX) index in order to improve the prediction performance. At the same time, based on the shortcoming of current literature approaches in text mining and trend components extraction, some possible methods are also discussed and verified. Apart from processing and utilizing additional features, the improvement of stock forecasting accuracies is also attempted through studying inter-relationships among multiple financial time series, traditionally believed to be useful for investors to optimize portfolios, and speculators to capture the chance of statistical arbitrage (most likely, pair trading). Existing researches, advocate of utilizing statistical correlations (co-movement in prices) between different stock entities as a factor mining method. However, the high correlation in the financial time series does not imply the causality. Therefore, we should reasonably address the later rather than the former. Notwithstanding the foreseeable improvement through modeling causalities, relatively few works are concerned with studying it and therefore explore the potential of lagging effects to boost accuracies of stock prediction. In this thesis paper, we would like to propose a novel leading stock based prediction framework (LSPF), dedicated to mining leading stocks. By definition in this study, a stock is considered as a leader once its rising or falling is preceded to others. In other words, the predictive power of any data modeling over led stock can be arisen by considering these leading stocks as factors in the modeling process. LSPF tracks the inter-leading and lagging relationships between stock entities by investigating three feasible leading stock mining models, respectively, linear Granger causality test, non-linear Granger causality test, and lagged correlation measurement. A leadership ranking approach is suggested to weight the importance of found leading and lagging stocks after mining processes. In studies of multiple features, our extensive experiments, with use of the Dow Jones consistent stock daily basis data in the New York Exchange (NYSE), show that our approaches with additional features obviously outperform those with price and volume only. More importantly, a profitable simulation trading result is gained (reaching over 200% annual return on several stock entities, in comparison with the same period Dow Jones Index performance, -25%) during the sub-prime mortgage crisis, justifying the effectiveness and robustness of our system against the economic depression. On the other hands, LSPF is evaluated in terms of its boosted accuracies over different prediction models, including neural network (NN) and support vector regression (SVR). Examined by the high frequency market microstructure data in the Hong Kong Stock Exchange (HKEX), it has shown that the LSPF is robust to volatile stock markets with its promising improvement in prediction accuracies, which confirms the presence and significance of leading stocks. Date: Friday, 18 June 2010 Time: 2:00pm – 4:00pm Venue: Room 3501 Lifts 25/26 Committee Members: Dr. Lei Chen (Supervisor) Prof. Frederick Lochovsky (Chairperson) Dr. Raymond Wong **** ALL are Welcome ****