(Message originally posted to the sourceforge mailing list, copied here for completeness)
I haven't had the time to work on it in the last week, but I still intend on doing so. I thought I'd update people with what I've been thinking. First, I would like to create a test suite I can use for regression testing. I've added a method for setting the seed, so in the future people can run regression tests without having to define FANN_NO_SEED. You can test the same library you've compiled for use. Makes sense to me. I'm still working on this part.
Re: vectorizing while maintaining code readability (can I interest you in teraflops?):
The most difficult part of writing vector code is creating the proper memory structures to operate on. In the case of AltiVec, you can (and I have) automatically vectorize code. But if you look at the result, the processor will often spend much of its time translating to and from the correct memory structures.
A quick example: I vectorized some FORTRAN code and got a 1.5x speed improvement on AltiVec. Not terrible, but it should be better. So I looked at the code that was generated by the auto-vectorizer. The vector processor spent roughly *half* its time translating from memory that wasn't 16 byte aligned. The problem gets worse when you attempt to operate on a vector of values (like neuron sums, activations, or weights) that are stored in structures with unused data surrounding each value.
The conclusion here is that the most important part of creating vector code is creating the correct memory structure.
I've decided the best way to do this is to create macros for loading/setting weights and other neuron values. The macros used can be chosen at compile time along with the processing method used. Something like these:
fann_neuron_wieght_store()
fann_neuron_wieght_load()
fann_neuron_sum_store()
fann_neuron_sum_load()
...
They will be customized to use data structures that are optimized for fann_run(), so the utility functions that operate on the nets will be a bit slower, but fann_run() will be more adjustable. You've probably heard the 90%/10% rule, where most programs spend 90% of their time in 10% of the code. These macros would be used in the code where the processor spends the least amount of time (90% of the code), and the code would be maintainable. This will allow people to create custom, fast, solutions for fann_run().
Therefore, someone can create data structures that could be used with scalar, AltiVec, SSE, and even GPU processors. The possibility of using a GPU is particularly interesting, because it opens FANN up to operating on the scale of *teraflops* with consumer hardware. See also this recent slashdot story and associated links:
http://hardware.slashdot.org/article.pl ... 01/1519254
I've been doing a bunch of reading, and I believe that this is quite doable once these macros are in place (see also:
http://www.gpgpu.org/ ). I would think that the ability to easily use massively parallel systems will open up new areas of ANN usage. (All this would probably also work with the Cell processor, but I'm not sure if that would be useful to users of libfann.)
To give people an idea of the speedups I'm talking about:
- Scalar
I expect this to run about the same speed as the original code. Packing neurons into a struct of arrays instead of an array of structs might speed the code up because of better cache usage. But using macros for neuron access in 90% of the code will slow things down a tad.
- AltiVec
In the least, it can operate on four floats at the same time, which yields a ~400% speed improvement. Lucky for PowerPC users, AltiVec also has hardware inverse, inverse exp(), inverse sqrt(), and some other useful functions. A patch for FANN v1.2 is optimized for AltiVec, and it was "between 5 and 20 (36 in one case!) times as fast". That's 500%-2000% speedup in real world tests.
- SSE
SSE is similar to AltiVec, but is missing some hardware features (like the inverse exp()). I would guess that SSE would end up closer to a ~400% speedup.
- GPU
This is where things really get interesting, and more so in the near future. Most modern GPUs have a texture processor that can do calculations using floating point numbers. They operate on vectors of 4 floating point values, and they can operate on many vectors simultaneously (8-32 pipelines?). As with all vector processors, their greatest speedup will occur with many interconnects between layers.
Imagine you want to use a 32x32 pixel display as an input to your net. That's 1024 input neurons. Two fully connected 1024 neuron layers will have 1.04 million interconnects (weights). As currently programmed, it would take 1.04 million passes to calculate the resulting sums. A GPU would treat the weights as a 1024x256 pixel texture and the neuron sums as a 256x1 pixel texture. It could operate on 128 weights at a time in floating point, resulting in 12800% speedup (theoretically).
The same number of interconnects are used if the input is 64x64 pixels and your hidden layer is 256 neurons. Or the input is ~74x74 RGB pixels and the hidden layer is 64 neurons. By my calculations that's enough processing power to simulate a fruit fly brain in realtime.
Thoughts? Misunderstandings? Miscalculations? If I worked on getting the scalar macros working, is there a graphics programmer out there who would get it running on GPUs (OpenGL)? Papers on working with massive ANNs?
Apologies if I'm confused on the GPU details. IANAGP
(I Am Not A GPU Programmer)
~Seth
Appendix A:
Fruit fly brain calculations
100,000 neurons total
* 1024 connections per neuron
* 100 operations per connection calculation
* 100 neuron firing events per second
~= 1 teraflops
http://en.wikipedia.org/wiki/List_of_an ... of_neurons