Sunday, October 10, 2010

Data mining and artificial intelligence update

Long time readers of this blog know that I haven't found data mining or artificial intelligence techniques to be very useful for my own trading, for they typically overfit to non-recurring past patterns. (Not surprisingly, they are much more useful for driverless cars.) Nevertheless, one must keep an open mind and continues to keep tabs on new developments in this field.

To this end, here is a new paper written by an engineering student at UC Berkeley which uses "support
vector machine" together with 10 simple technical indicators to predict the SPX index, purportedly with 60% accuracy. If one includes an additional indicator which measures the number of news articles on a stock in the previous day, then the accuracy supposedly goes up to 70%.

I did not have the chance to reproduce and verify this result yet, but I invite you to try it out and share your findings here. If you do so, you may find this new data mining product called 11Ants Analytics useful. It is an Excel-based software that includes 11 machine learning algorithms including the aforementioned support vector machines. It also includes decision trees which are sometimes quite useful in automatically generating a small set of trading rules from an input set of technical indicators. (Whether those rules remain profitable in the future is another question!) If you have tried this product, I would also appreciate your comments here.

(If you are a die-hard MATLAB fan, support vector machines are available in their Bioinformatics Toolbox, and classification and decision trees in their Statistics Toolbox.)

54 comments:

Glyphard said...

If one has the mathematics background to understand SVM and other machine learning tools, here is a link to some time saving results from a machine learning class at Stanford:

http://www.stanford.edu/class/cs229/projects2008.html

rara avis said...

Ernie, in MATLAB there is Spider which has excellent SMV libraries.

Jez Liberty said...

To check the effect of data-mining and check if the results are statistically significant (rather than curve-fit/data mining bias), the bootstrap test implementing White Reality Check is a very powerful tool.

Thanks for the links...

Jez

Ernie Chan said...

Thanks all ... very good resources all around so far.
Ernie

smc said...

These are also freely available in R. My favorite package for SVM is kernlab, but you can find more resources here:

http://cran.r-project.org/web/views/MachineLearning.html

Anonymous said...

If those results were true, why would anyone share them? Surely he must know he could either use the algorithm himself, or sell it, to make a fortune.

Ernie Chan said...

Anon,
As I explained in my book,
the utility of any published strategy is not that one can immediately profit from it without modifications, but that it may become profitable after your own improvements.
Ernie

Will Nelson said...

Would not 60/70% accuracy be enough to guarantee profits without any modifications?

Nizar said...

Will, it depends on the average size of profits/losses.

Nizar.

Nizar said...

Just to add, not sure if you guys are aware, but Rebellion Research run their fund purely using AI-based systems.

http://rebellionresearch.com/main.html

(Interesting Quiz under "Job Opportunities")

They can hardly compete with DE Shaw and Rennaisance in terms of their size ($7million assets) but the fund was set up and is being run by four twentysomethings. They were up 41% last year.

I don't know of any other hedge funds that uses AI.

Ernie Chan said...

Will,
There are lots of pitfalls in backtesting a strategy even without using machine learning techniques that may render "60% accuracy" quite meaningless. For e.g., are we certain that there is no look-ahead bias in this research? The only way to find out is to replicate the results.
Ernie

Alexandre Rubesam said...

I personally have tried using many different machine learning concepts (including SVMs) for trading, and never found a consistent technique, nor have I met someone who can make a consistent profit that way (which by itself means nothing about the possibility that such a technique might exist).

Their results sound encouraging, although I find it hard to believe that they would withstand a rigorous replication. Even if it were replicated and had 60% accuracy, from what I understood they are using contemporaneous information, i.e. it's not 60% out-of-sample prediction accuracy, which would be more than enough to make a trading strategy profitable.

Nedzad said...

Good morning Ernie,

I am reading your book and I have some questions I hope you don’t mind answering.

1. In your book and your blog, you mention that advanced statistical techniques may not be that useful in the quantitative trading. Yet, such techniques are taught to the MFE students at NYU. Here is the link: http://cims.nyu.edu/~almgren/timeseries/

Where do you think these techniques can be utilized? Or are they really just an academic exercise?

2. Will you be giving your seminars next year too?

3. For individual traders, do you suggest investing in one of the backtesting platforms such as Alphacet? What is the price range these software packages fall in?

4. Do you still do pair trading strategies? If yes, are they profitable for your business?

Many thanks for your great help!

Ernie Chan said...

Nedzad,
1) By "advanced statistical techniques" I meant "advanced data mining/machine learning techniques". Time-series analysis that uses linear models are quite suitable for trading applications.
2) I hope to give seminars every year.
3) Alphacet is designed for institutional traders. For individual traders, platforms such as TradeStation, Matlab, or R would be more affordable.
4) We continue to pair-trade. Our fund has been profitable for the last 3 years, but we do trade more than just pairs.

Best,
Ernie

Anonymous said...

Hi Ernie,
Do you think that Principal component analysis or random matrix theory would be suitable to find clusters and trade overperforming/ underperforming components against the remaining group?
As a former physicist you will probably be able to answear my question.
Great blog. i haven't read the book yet but it is just a matter of time.
Thanks

Ernie Chan said...

Hi Anon,
Thanks for your kind comments.
I have not been able to apply PCA in any profitable pursuit. However, that doesn't mean you can't make it work!
Ernie

Anonymous said...

Hello Ernie,
You say in your book that while working in banks or hedge funds you suffered big losses which probably came from over complexifying. Did you at that point used the backtesting methodology you advocate for nowadays? If not, do you think it would have shown the flaws in the models you were building at that moment?
I'm quite new in the industry and still wondering how much one can trust backtesting results.
Thanks

Ernie Chan said...

Hi Anon,
Careful out-of-sample backtesting can reduce, but not eliminate, the overfitting problem associated with overly complex model. The basic constraint is the limited amount of recent data suitable for out-of-sample testing. So I don't believe an improved backtesting method would have avoided the loss we suffered.
Ernie

Anonymous said...

Well thanks for the answer.
About the overfitting problem of complex models. You often adjudicate for linear time series models instead of non linear ones. It seems to me that a Garch model would not have many more parameters than an ARMA. Yet the fitting would certainly be better. Do you think we would really increase the overfitting risk?

Ernie Chan said...

Hi Anon,
If a nonlinear model is justified theoretically apart from the fact that it fits the data better, then I agree that it should be used instead of a linear model.
Ernie

Anonymous said...

Hello Mr Chan,
I was reading one of your old post (december 2006) about DNA, speech recognition etc and their applications to financial forecasting. That reminded that I want to learn more about HMM for quite a long time. Do you still have the same view about theses models? ie that they most probably work (@ Rennaissance) but you couldn't make them work at your frequency.
I'm actually looking for models that would work at at higher frequencies than the ones I use at the moment.
Do you know if there is a reference book to buy for someone that doesn't know more that doesn't know much about HMM?
Thanks in advance.
Zarbouzou
ps : Amazing blog

Ernie Chan said...

Hi Anonymous,
Yes, I still have the view that HHM's probably work best in higher frequency data, though I still have not tried them -- maybe I should!
Ernie

Anonymous said...

Thanks Ernie,
Any reading you would recommend
Zarbouzou

Unknown said...

Question from a trading novice: What is a good source for historical equity data that I can use to train a neural net?

Ernie Chan said...

Jordan,
Yahoo Finance and csidata.com both offer good historical equity data.
See Table 3.1 of my book for other sources.
Ernie

Ernie Chan said...

Anon,
The paper I cited in the main post also talks about HHM and gives other references.
Ernie

Dan Rico said...

I was able to partially reproduce the results from the paper in Matlab - thank you Ernie for sharing the link with us.
The SVM performance in predicting the stock & index prices (within the American & Canadian markets) matches the ones published by the authors.
I wasn’t able to check the performance improvement as a result of adding the additional feature - number of news articles on a stock in the previous day - because I don’t have the tool to gather data.
Does anyone know about a free engine that searches the web and parses out daily news articles about a stock?
On the same note, instead of using only the number of news articles on a stock I was thinking to include as a feature individual view on news articles and the weight-on-views similar to Black-Litterman portfolio management approach. Any comments?
I’ll appreciate any feedback and recommendations regarding stock/index/security features that can be added to the SVM engine to boost the performance.
I’ve used SVM before in the image processing field and I can see the broad application in the quantitative trading field.
Pros:
- reduce the risk of overfitting: small number of parameters to tune (only 2) compared to the back propagation neural network
- less sensitivity to noise
- dimensionality is not an issue
Cons:
- the process of kernel function selection can be cumbersome
- require a lot of memory and CPU time

Overall I think that SVM is a powerful toy to play with and I strongly recommend it.

Ernie Chan said...

Dan,
Thanks for the detailed report! It is good news indeed.

I don't know of any free tool for counting news on a particular stock, but if you are willing to spend about $160 a month, you can subscribe to the Newsware service, then you can download all the news related to a company on a day to a text file for your analysis.

I too have heard many good things about SVM, and have downloaded the free Matlab package Spider which rara avis suggested.

Ernie

Jason Victor said...

I have had wonderful success with support vector machines, in fact I have created a software product specifically for people looking to create strategies with them:

http://www.awwthor.com/finatica

Anonymous said...

Is anyone else than me having a problem with the fact that they train the SVM at each new data?
I doubt that practical implementation can be done on a large number of stocks.
Are we not fitting noise if retraining at each step?

Will Nelson said...

Ernie,
This is a bit off topic but I'm wondering if you're aware of any research investigating statistical differences between financial price charts and their upside-down versions. In other words, do uptrends/downtrends and tops/bottoms differ in any meaningful in, for example, EUR/USD?
If they did it wouldn't necessarily help trading, but it seems interesting. To me, visually, tops seem more spiky than bottoms. But that means nothing.

Ernie Chan said...

Will,
Indeed quite a few people have noticed this difference, but I am not aware of any published research on this observation. In equity price series, I suspect that sudden drops are more often and happen more rapidly than sudden rise. But in FX, which in general should not have trend, and since there is no sense to talk about top vs bottom since we can just reverse the currency symbols, this should not happen.
Ernie

Dave Redpath said...

I've worked with classifiers for 4 years and found SVMs to be the easiest to use. They still have the issue of parameter tuning. I've successfully applied them to one of my own swing trading systems. Like you say machine learning does like to overfit certain time periods and stocks. I've tried to get around this using different stock batches I hope to get some info up on my own site soon <a href="http://www.thealgorithmictrader.com>my own site</a> soon. Keep up the blog - its good to see it generating some discussion.

Ernie Chan said...

Hi Dave,
Yes, I believe if you aggregate normalized training data from different stocks before you train the classifier, the results may be less subject to over-fitting. However, I find that to be too computationally intensive for my computer for large number of stocks over many years. Maybe I need a better computer!
Ernie

Dave Redpath said...

Ernie, even from reading and experiments I still can't decide what the best training method is. I don't like normalizing the stock price anymore since I believe a $10 stock behaves differently from a $100 stock. I ended up spliting by price and using a classifier cascade/ensemble for each price band.

Chris Sutherland said...

Dan Rico was looking for the web site that parses news feeds and assigns a rating.

The company was called StockMood.com

Here's the link to a company reference:

http://www.pr-inside.com/techcrunch50-honors-stockmood-com-with-r834456.htm

They were in TechCrunch's top 50 of 2008.

I think they were gobbled up by a larger company and now included as a feature in a pay for service. (Thomson Reuters?)

Ernie Chan said...

Hi Anon,
Retraining at each step is not necessarily fitting to noise. Financial time series is unfortunately not stationary in a general statistical sense. So sometimes we do need to adapt the model parameters continuously. But yes, it is quite computationally intensive, but maybe you can use the Amazon's EC2 parallel core processors together with Matlab's parallel computation toolbox to help.
Ernie

maverick said...

Hi Ernie,

I am a beginner in data mining and AI. What kind of platforms would you recommend for the backtesting of trading strategies ? Any particular books you recommend for backtesting in general besides yours ?

Ernie Chan said...

Hi maverick,
I use Matlab for my backtesting, so I would recommend that too. But you can use R too if you prefer a free platform.

You can find a list of recommended books on the right sidebar of my blog. However, I have not read any book that focuses on backtesting, since I learned all the relevant techniques from various colleagues and mentors in investment banks and hedge funds.
Best,
Ernie

Anonymous said...

Hello Ernie and everyone.
Has anyone tried to change the technical indicators or their parameters? Certainly given enough technical indicators and choosing them carfully one can at the end find a good accuracy (I'm not sure if SVM is really needed to do so). It would be interesting to know if the algorithm is not to sensible.

Ernie Chan said...

Hi Anon,
I have tried to add and subtract some indicators, and indeed it improves the predictive accuracy somewhat.

However, the main problem is that even with the improved accuracy, the program is far from able to generate a good Sharpe ratio. It is possible to have a program that predicts better than 50% accuracy whether the next day return is positive or negative, and yet still have negative Sharpe ratio, simply because the positive days are of smaller magnitude than the negative days.

Ernie

Unknown said...

Just wanted to report that I did indeed try the 11antsanalytics free trial. I tried to replicate the construction of a model on the GOOG using the 10 mentioned inputs (EMA7, EMA50 etc), and 11ants was able to construct a model that was 90%+ accurate.

However, since it was optimizing on the test data, I held back some data, kept it out of the optimization and tried it with the predict mode. The results were far less spectacular; and more importantly, guessing the wrong direction on the majority of days.

So I also tried forcing it to classify patterns into bullish, bearish or sideways results, and it wasn't able to go better than 5% accuracy.

Lastly, after reading some literature about vantage Point's claimed success with NN, I strung together inter-market data from:
^TNX, ^TYX, ^USDX, ^CRB,^G09 (Crude oil spot index),$EURUSD,JPYUSD, ^SP500,^G53 (Gold spot index)

Currently I'm running a prediction test for the next day close. The bullish, bearish, sideways classifier test results were terrible.

Unknown said...

According to the author of "Inside the Black Box", NN have been successfully employed for determining position sizing and risk management, as well as order book deconstruction.

The latter is where I want to eventually focus my efforts with machine learning. From what I understand these NN hunt out large iceberg orders by pattern matching. Typically the buying/selling algorithms tend to leave a fairly consistent footprint (although I'm sure this won't be the case forever).

To that end, I've been recording L2 market data daily for about a month now on the ES, ZB, ZN markets.

Ernie Chan said...

Jeremy,
I agree with your observations on SVM. I have yet to find any good performance measured by Sharpe ratio out-of-sample, and changing the future categories to up/down/sideways makes it worse.

But I am intrigued by your comments about using NN to detect iceberg orders.

Ernie

Anonymous said...

Has anyone tried using NinjaTrader as a backtesting and execution platform. It intergrates very easily with my IB account.
I am not sure how accurate the backtesting is, but it provides all the bells and whistles for a modest cost.

One of the most dynamic features is the optmizing tool. It allows you to run 1000s of iterations with different periods and timeframes and choose the optimal parameters.

I'd love to get comments from others on the PROS and CONS.

Thanks.

Unknown said...
This comment has been removed by the author.
Unknown said...

I use it for live trading and backtesting all my automated strategies. It's not without it's quirks, and there are some limitations if you are really trying to manipulate your own order management, but otherwise it's fairly convenient when it comes to acquiring and putting real data to work.

If you're a real developer you might find a few things they did a little odd, since it was designed for "non-developers" and so you give up clean object oriented programming for simplicity in some cases. It uses C#, which is quite powerful.

As for suiting your needs, I'd suggest downloading it and trying it out. It's free to use indefinitely until you want to trade with real money. You don't have anything to lose by checking it out.

Anonymous said...

Hi. I read the paper by Rao and Hong that was referenced in the second paragraph of this thread. However, being entirely new to this area, I was confused as to how the authors specified the 5 states they used as defined in section 2.1.1: big price move up, small price move up, no move, small price move down, and big price move down. It seems that even if you use a clustering algorithm, you still have to define what a large price move is (size and time frame).

Ernie Chan said...

Anon,
You can classify the moves based on a moving percentile ranking of past returns.
Ernie

John Carse said...

Ernie,

In Table 3.1 of your book, you have provided a list of historical databases for backtesting. While I was looking for a data provider, I came across a company called Norgate based in Australia that claims to have survivorship-bias free data. Do you have any knowledge or experience with Norgate. A link to their offering can be found here:

http://www.premiumdata.net/products/premiumdata/ushistorical.php

Great book!

John

Ernie Chan said...

John,
I am not familiar with the data vendor you mentioned, but I know that csidata.com provides delisted stock historical prices for a very reasonable price.
Ernie

Anonymous said...

Hi Erine

Just off the topic a bit, I m a newbie on developing trading strategy, let say I have 3 years historical daily close data of a stock, for example, I just apply a simple moving average crossing strategy and find some specific Moving average period is quite profitable over long run, in this case how can I apply "bootstrapping test" to see if the result is not due to over-fitting.

Anonymous said...

Hi Erine

Just off the topic a bit, I m a newbie on developing trading strategy, let say I have 3 years historical daily close data of a stock, for example, I just apply a simple moving average crossing strategy and find some specific Moving average period is quite profitable over long run, in this case how can I apply "bootstrapping test" to see if the result is not due to over-fitting.

Ernie Chan said...

Hi Anon,
See David Ruppert's book (http://tinyurl.com/mb88nby), section 10.5, "Bootstrapping Time Series".

One method is to simulate the data using an ARIMA model. The other method is "block bootstrap" by picking blocks of data at a time to preserve correlations.

Ernie