Genetic algorithm for intraday trading

kurthulse · Post by **kurthulse** » Fri Jan 01, 2016 11:41 pm

Development of an evolutionary genetic process to find intraday trading setups

This is the first of what will be a series of posts about my attempts to develop a trading system using an evolutionary genetic process. I'm sharing these notes in the Wave59 forum for several reasons:

1) They might be useful to others out there who are trying to develop "machine learning" and "data mining" solutions to trading.
2) Some of you might have suggestions about how I can improve the development of THIS learning machine. Those suggestions might be rooted in your experience using certain trading strategies, indicators, or forecasting tools. Alternately, suggestions might come from your own experiments in machine learning.
3) We might discover ways to collaborate.
4) Some readers might simply find the notes interesting.

As this thread progresses, I welcome ideas and suggestions about future directions for the research. (I might not always know how to apply your suggestions, but they are still welcome.)

The learning machine that I'm creating is not the same as the one Earik has built into recent versions of Wave59. Even so, there might be areas of similarity. I specifically wanted to start the development of THIS machine without having prior knowledge of how a genetic algorithm "should" work. It will be interesting to see whether the two approaches went in the same directions. (I might have taken some wrong turns.) Anyway, now that I have brought the machine to a certain point, I no longer feel a need to work on it in isolation.

Overview of the prototype machine

Price data and pre-calculated indicator values are housed in Microsoft Access database tables – a data repository – which currently includes information from the period of January 2010 to October 2015.

The evolution program was written in Microsoft's Visual Basic for Applications, and resulting data about the evolution of trading strategies are housed in a separate database that links to the tables in the data repository. Theoretically, this would allow for multiple versions of the evolution program to work simultaneously while drawing from the same data repository, but right now I only have enough computing resources to allow for one evolution program to run at a time.

The programming and the data architecture were constructed with the following objectives:

* Discover trading strategies that work intraday, without allowing positions to be held overnight.
* Initially work with ES price data, but be adaptable to other markets.
* Standardize all trading indicators in a way that allows them to be subjected to the same types of evaluation. This approach allows each indicator to be replaced by any other eligible indicator as dictated by the fitness tests and the randomized events of the evolutionary process.
* To reduce computation time, calculate all indicator values before the evolutionary process begins. These values are stored in the data repository, along with actual price data. Nearly all of the indicator values were originally calculated in Wave59 and then exported. There are advantages and disadvantages to this approach.
* To allow for multiple perspectives on the market environment, use indicators from multiple time frames simultaneously. This might be the largest area of difference between this machine and Wave59's genetic algorithm.

As an aside, I do not think of myself as a programmer. I'm just sort of muddling through with Visual Basic, but the program finally runs without crashing. I do have some background in biochemistry and genetics, which probably has informed the architecture a little bit.

The primordial soup - base data and indicators

In addition to standard OHLC values for price bars, most of the indicators used in the prototype were created so they can be calculated in the same way for different time frames. The time frames used in this intraday system include 60-minute, 15-minute, 4-minute and 1-minute bars, each stored in separate tables. The evolutionary process can "select" specific indicators to include in a trading strategy from among the entire set of indicators. Thus, indicators and their values are the components from which a trading strategy is built.

The choice of which indicators to include in the data set must balance the desirability of having more permutations of an indicator to represent market conditions, versus having fewer permutations in order to make the computation process efficient. Even without considering the calculation settings for each indicator, the use of multiple time frames already introduces multiple permutations. On the other hand, having only four time frames constrains the solution space in a way that I hope makes the problem more manageable.

The guiding principle I have used in selecting indicators is that each one must provide a perspective on market conditions that is substantially different from what other indicators in the system provide.

The aspects of market conditions that I can think to capture (so far) include:

Momentum, strength of trend, direction, etc. – These are expressed with various derivatives of the Adaptive CCI. An advantage of using the Adaptive CCI is that its output is normalized with respect to its zero line. Values rarely exceed ± 300.

Cycles, periodicity – This is described with a version of the Lomb periodogram.

Wave structure (1) – For this, I am using the Wave59 9-5 indicator, with special attention to "terminal" values of 8*, 9, 14* and 15.

Wave structure (2) – the system includes indicators describing the number of consecutive higher highs, lower highs, etc. that have preceded the bar that is being tested.

Market impetus, energy, etc. – Currently, I use various calculations based on the intraday NYSE TICK Index to stand in for the force that's behind a price move. I would like to include more indicators that are based on the new Wave59 Energy Bars and the bid-ask difference, but my data set for those things only goes back to the beginning of 2015. (This suggests an area of future exploration – acquire historical tick-by-tick data and translate it into Wave59 format.)

Specific events – An event, such as price poking through the Bollinger band, can sometimes foretell a change in trend. However, the event usually occurs earlier than the ideal trade entry time. To codify events such as this, the system notes the specific price bar on which the event occurs, and then it begins a decay count that goes from zero (the event bar) upward. Thus, the indicator value that gets incorporated into a trading strategy might be a decay value that lies within a certain range. For example, a trading strategy might look for an upper Bollinger band poke that is no more than four bars old. Other types of events include various momentum divergences, and they also work with decay times.

Data series for all of the indicators described above are included in the system for each of the four time frames. However, not all of the data series are "turned on" for a given experimental run. I have tentatively concluded that there is some benefit in choosing which indicators to add during successive generations of the evolution to guide and sculpt the population of trading strategies.

Inexplicable stuff – The system allows incorporation of esoteric functions, such as astrology calculations, as long as they can be expressed as having output that is inside or outside of specific ranges of values. I don't know much about astro trading, but Earik's Buttonwood thread inspired me to include the progressive and spherical indices in the current run of the evolution program. I'm sure there is more that could go here. Astrology represents the only type of indicator currently in the system that is completely independent from actual price data, making it (hopefully) predictive rather than derivative.

Other things – The system includes a small number of odd data series that don't fit into the above categories, including time of day, NYSE Net Issues, and a few others.

Market data that is not included -- One important aspect of market conditions that is not included in the system is chart geometry. As Earik mentioned in the Buttonwood thread, it is difficult to tell a computer how to interpret lines and curves in relation to price bars. I don't think it's impossible, but my earlier attempts at working on this problem didn't make much headway. As a framework for handling this kind of thing in some future version, maybe it would make sense to describe geometric objects on the chart (e.g., a line) as a field radiating out from the object. The system could evaluate the proximity of the highs and lows of price bars (or some other way of measuring price) in relation to that field. Perhaps the fields from different objects would interact. (I'm just speculating here.) This does not seem like the most urgent direction for future research.

In the next post, I'll describe how the candidate trading strategies are constructed.

Happy new year, everybody!

kurthulse · Post by **kurthulse** » Sun Jan 03, 2016 4:07 pm

The structure of trading setups in this system

The evolution program breeds trading setups. In keeping with the biological metaphor, I have taken to calling individual trading setups cells. A cell is characterized by a collection of trading rules based on the indicators described in the previous post.

Also as mentioned earlier, the design of the system allows all indicators/genes to be evaluated and handled in a consistent way, such that they are interchangeable during the process of cell pairing, mutations, and other types of replication. An indicator can be thought of as a gene, and different permutations of the evaluation criteria for a specific indicator are alleles of that gene.

A cell's own (possibly unique) set of indicators/genes and their specific alleles is the cell's genotype, and the evolution program keeps a record of the genotype of each cell that it produces throughout the evolution process.

The entirety of genes and their alleles that are available to be used in the whole system is the genome – like a library of indicators and evaluation criteria. As the evolution experiment proceeds, it is possible to "turn on" additional genes to make them available for cells to include in their genotypes, based on randomized criteria during the replication process. It is also possible to build new genes into the system without having to reset the whole process.

In order to enter a trade on behalf of the cell, the program must first determine that all of the cell's gene/allele combinations match the market conditions for the price bar being tested. (Trade exits are a different matter, and I'll cover those later.)

In addition to its genotype, a cell also carries information defining it as a long-trading or a short-trading cell. Thus, a trading setup in this system can only point in one direction. If you have tried to develop trading strategies in a formalized way, then you probably have noticed that the ES market is not "symmetrical" with respect to upward and downward moves. A setup that works well for long trades is likely to be less successful with short trades, and vice versa.

Encoding the genes and alleles

In order to make all the genes interchangeable and able to be evaluated using the same mathematical grammar, I rewrote indicator scripts in Wave59 so they could express the output of each indicator in terms of integer values. These values were then exported and imported to the data repository mentioned in the previous post.

For example, a cell might have a gene that requires the 15-minute Adaptive CCI to have a value greater than -200 before initiating a trade. If the cell has multiple copies of the same gene, with different alleles, it might require that the value of the Adaptive CCI be greater than -200 and also less than +100 before initiating the trade. Through the competitive selection process, cells can develop very specific criteria. On the other hand, unhelpful genes or alleles can be taken out of the population over successive generations if their host cells don't trade well.

Trading performance and the fitness test

In order to evaluate a cell as a trading setup, the evolution program tests the cell's trade entry criteria (as encoded in its genes and alleles) against the actual indicator values that were present in the market on a minute-by-minute basis for selected trading days. If all the criteria line up, then the cell enters a trade in whatever direction matches the cell's orientation. Other criteria are used to determine trade exits. Also, any open positions must be closed at the end of each trading day. The evolution program keeps track of how many winning trades and losing trades each cell produces, as well as the net profit or loss after commission and slippage. Cells that come out of the trading tests with higher scores are more likely to get invited to the replication party that produces the next generation.

We know we want to avoid the "curve fitting" phenomenon in which the machine learns how to trade the historical market without being able to trade the real market. My approach to this problem serves two purposes – avoiding curve fitting and also keeping the computation time manageable. For each generation of cells, the evolution program chooses a set of trading days completely at random from the data repository. The selected days are not likely to be contiguous. Right now, I have the program set to choose 40 individual days to test each generation of cells. The next generation gets a different 40 days, and so on. Although it is true that the entire population of cells over all the generations is evolving to "fit" the market as it existed from 2010 to 2015, no individual cell ever gets to see more than 40 trading days from within that history.

They grow up so fast (or not)

The speed of the evolution program seems okay, but I don't really have much to compare it to. When I run successive generations of a population of 400 cells, giving each generation 40 days of market data, it goes through about one cell every 7.5 seconds, or one generation in 50 minutes. The pace also depends a little bit (not a lot) on how many genes the cells are programmed to have. I'm working on the assumption that the program will need to go through several hundred generations (or more) to produce good results.

This suggests another thing to try -- get a faster computer. Currently I run the program on a six-year-old Windows 64-bit machine with an Intel Q8300 quad-core CPU at 2.5 GHz. However, Microsoft Access only makes use of one of the four processors at a time. I also use the same machine for other things occasionally. It might help to get a newer, faster dual-core machine that could just run all the time. In the longer run, it might make sense to find out whether the whole project can be migrated to a database platform that makes use of more than one processor core. (Is that even possible?)

In the next post, I'll present some examples of setups the system has produced. I will also describe some of the limitations and challenges the system has shown so far, as well as questions that I'm hoping might draw input from others here.

Post by **earik** » Mon Jan 04, 2016 3:22 pm

Hi Kurt,

Very cool project you're working on! Thanks for sharing.

I'm sure you've thought of this already, but be very careful with aligning the different time frames together. I remember doing something similar at one point, and my results were so awesome that I thought I had the holy grail of machine learning systems. Turned out that I made a programming error and was reading data from the future due to not aligning the different time frames quite right. So if your results are amazing and unbelievable, that means something similar just happened in your case.

Excited to see what sort of results you end up with!

Regards,

Earik

supracharger · Post by **supracharger** » Mon Jan 04, 2016 6:33 pm

Hey Kurt,

I am also developing my own separate AI program written in C#

! I wanted something more customize-able, and also something that one wouldn't have to pay an "arm and a leg" for. I have a working Genetic algorithm and Neural Net, and putting the two together, an Evolved Neural Net. Additionally, I have a good framework to incorporate trading with the AI.

I am currently trying to find ways to improve my GA also

(Incredible!). I did read a lot on the web to try to find those improvements. On another note, I am starting with the basics of AI, verifying its foundation, then to more advanced AI methods where I could get away from Curve Fitting.

Kurt: Do you understand multi-threading? if you don't, you don't need a better computer, since your code is running on only one thread.

- Andrew

Post by **earik** » Mon Jan 04, 2016 6:55 pm

Kurt: Do you understand multi-threading? if you don't, you don't need a better computer, since your code is running on only one thread.

That's actually more of a programming thing than a computer thing. I'm not sure if VB allows multiple threads or not. One clumsy (but doable) workaround in that case is to break the work up among multiple programs. That way Windows will just handle spreading the work across all 4 cores itself. You can be as low-tech as having one program write the training data to a text file, as well as other text files with the various solutions to test, and then have it launch a number of worker apps, each of which reads the data, grabs one of the files, and gets to work. Then, when done, they all record the results somewhere. The only technical detail is that you'd need a way for the worker programs to signal to the master program that they've finished, which could be done via sockets, or if that's too complicated, than via some sort of status.txt file. Definitely clunky, but as long as you give each program enough work to offset the overhead of writing files, starting up, etc, you'd gain some performance.

Back when I was doing a very similar sort of thing, I got all excited about setting up a home-built cluster so I could run massive GAs at fast speeds. I bought a whole bunch of motherboards and cpus, and paid a guy to build me a huge custom case that could support multiple power supplies, etc. My plan was to do the same sort of thing that I outlined above, but distributed across multiple computers (in one housing) that all talked via sockets. I spent months fooling around with the whole thing, but ran into all sorts of technical difficulties, and never ended up getting the whole thing off the ground. In the end, I had to stop because the project was completely interfering with the rest of my life, and I ended up having to sell probably $10k worth of hardware via eBay, and had to throw all the rest out. Nowadays, you can just buy multi-core CPUs which do the same thing, so this is quite a bit easier to pull off than it was even a few years ago.

Regards,

Earik

supracharger · Post by **supracharger** » Tue Mar 08, 2016 6:29 pm

Hi All,

From sure cost, talking to Earik, and the fact that it is not needed, I decided to steer away from HPC (High Performance Computing). A normal computer nowadays can process a ton of data, so with A.I. I think a person would want a smarter algorithm, rather than an algorithm that has to be run in a cluster of Servers. I think I would go and venture to utilize the GPU using something like CUDA before I venture out and start HPC. Earik, it’s hard to hear that you went through all of that work of building a home-built cluster to never finishing it. A few years ago I built a Dynamic Optimizer in C++, using the techniques that you used in your post above. After finishing it, I played with it for a few months, and never found a use for it again. Some could say it was a lost cause, but I think that it is one of the phases a person has to go through, I did attain a greater understanding of C++ so it wasn’t too much of a loss.

Anyways, I’m constantly trying to find ways to improve my G.A. so that it can lock onto a good solution quicker than another G.A. One of the things I think one needs to have in this field is to know “What it is doing.” One can do that through various graphs of different components. An example would be “How much the Fitness number variates in the population of one generation.” If the Fitness number varies very little or not at all then, those generations are a waste of computing power.

Since in those generations the calculated fitness is the same, there is not a good selection to pick from in those generations which are a waste. So the purpose of a G.A. is to find the best solution, if all of the Genomes in the population are all “Bob’s” then that whole population is a waste because no matter how you slice it the best Genome in that population will always be “Bob,” so it is a waste of time, genome generations, and Computing Power. Hopefully, that makes since.

- Andrew

Wave59 Technologies

Genetic algorithm for intraday trading

Genetic algorithm for intraday trading

Re: Genetic algorithm for intraday trading

Re: Genetic algorithm for intraday trading

Re: Genetic algorithm for intraday trading

Re: Genetic algorithm for intraday trading

Re: Genetic algorithm for intraday trading

Who is online