Zezo forum:

Post by zezo | 2009-06-09 | 13:17:53

I will post few articles describing how the optimizer and the site work, including performance tuning tricks that I've used.

14 Comments

Post by zezo | 2009-06-09 | 13:21:10

The math:

The optimum sailing route is classic optimization task. It's described briefly in this Wikipedia article

As with other NP-complete problems the trick is reducing the problem space without missing the optimum solution.

Brute force approach becomes impractical too soon as the number of all possible tracks is infinite for all practical purposes (like the computer chess programs)

A quick calculation:

VORG boat polars have 68 discrete points. We can discart some of them with the empirical knowledge that sailing at TWA < 36 ot > 155 won't do any good, and that leaves 54 possible directions at each step.

If 10-minute steps are used the total number of positions in 1 hour is 54^6 or 2.5*10^10, and that's already too much.

So we need a way to reduce the number of starting positions at each step. I've used simple polar approach where the number of 'best' points at each step is fixed. May not be the best one, but gets the complexity down from O(N^2) to O(N).

Then some assumptions are made about reasonable direction - like that you don't want to sail in direction opposite to the destination. That does not work when you have land mass in you route, but the polar approach is not very good in that case anyway.

That gets the number of possible destination down to about 30, depending of the TWA with more choices when reaching.

Now the number of possible positions for 24 hours is reduced to about 5 Million and that is manageable in real time on modern CPU.

To be continued ...

Post by zezo | 2009-06-09 | 13:57:38

The program:

I coded the first proof on concept in perl, just to see if the chosen approach will work at all. It worked indeed, but was relatively slow - with run times in the order of minutes.

I had the idea of turning it into online service from the beginning, and the run time had to be reduced to few seconds to make is suitable for multi-user environment.

So the next step was rewriting the core of the algorithm in C and checking the performance. Results were pretty interesting.

The original run time of the perl version for given parameters was about 300 seconds.

The time for unoptimized C program was about 16 seconds, or 20 times faster. That showed that interpreted language as perl is not that much slower than C as we usually think, but was still too slow.

Reducing the accuracy from double to single precision math brought that down to 10 seconds. Compiling with full optimizations for the target CPU brought that down to 6s.

The compiler used so far was gcc4. Then I decided to give the commercial ($800 license) Intel cc compiler a try. Fortunately there is time - limited trial available.

Compiling the code with it resulted in 2-second run time! Obviously gcc was not using the full vector math capabilities of modern CPUs.

I try to avoid using commercial software if possible, so decided to help the gcc a bit. There are 2 freely available sse/simd libraries for the X86 platform. Both have the very useful sincos4 function, which calculates both sin() and cos() of four angles in parallel, and that happened to be very useful for the great circle formulas.

The vectorized great circle destination function looks like this:

float gc_distance(float lon1, float lat1, float lon2, float lat2)
{
vf4 vx, sin4, cos4;
vx.f[0] = (lon2-lon1)/2;
vx.f[1] = (lat2-lat1)/2;
vx.f[2] = lat1;
vx.f[3] = lat2;
sincos_ps(vx.v, &sin4.v, &cos4.v);

float a = sin4.f[1]*sin4.f[1] + cos4.f[2] * cos4.f[3] * sin4.f[0] * sin4.f[0];
return 6378/1.852 * 2 * atan2f(sqrtf(a), sqrtf(1-a));
}

Using sse_mathfun.h brought the gcc results within 20% of the icc, and that was good enough, so at this point I could go on with the rest of the project.

To be continued ...

Post by zezo | 2009-06-09 | 14:41:47

The site:

Now the overhead of loading GRIB files, polar diagrams, land collision chart and other data was becoming significant compared to the math core of the algorithm.

There are to approaches for reducing that effect:

1) Just reduce the overhead
2) Make the program run as daemon, so overhead applies only once for many optimizer runs.

I've chosen the simplest (first) one with the following tricks:

Ground map and boat polars are turned into object files and compiled into the executable. The polar diagrams themselves are reduced to single array for 'ideal' sail at every possible integer angle. This way determining the boat speed turns into single table lookup operation.

Grib data is compiled into raw binary file once every 12 hours, and that is read() as array into memory. Works many order of magnitude faster than parsing grib/txt/xml files.

Development time so far has been about two weeks with 16-hour days, it was the end of March, the start of Leg 6 was approaching and I still had to build a web site to turn the optimizer and accompanying perl/shell scripts into useful service.

It was a mad rush. Some compromises were made, there was no time to implement proper scrolling chart, data caching was broken and so on, but the site went online in the morning of Apr 11th and had gathered about 1000 users at the end of the leg.

I was fixing bugs and doing support during the leg, so there was little time for development, but the long rest in Boston provided enough time to implement most of the current feature set. The full Changelog is still available.

To be continued ...

Post by zezo | 2009-06-09 | 15:23:51

Fine tuning:

The site was becoming quite popular and I was concerned about performance.

Again two possible approaches - throw more CPU power at it, or tune everything to the edge first.

Adding CPU power does not usually solve the problem. It's been stated that development of the algorithms in the past 50 years has done more for speeding up computations than increasing raw power in the same time. And according to Moore's Law raw power had increased millions of times during that period.

Here is a list of optimization tricks I've used, in no particular order:

Everything is cached, at few levels.

Every computed track is stored in a database as numerical data, and reused for subsequent display at different zoom levels. Few boats in the same position will also share the data - this works only at the start.

The rendered png files are also stored in disk cache and served as static content if needed.

Carefully adjusted HTTP expiration headers switch the caching from server to client, further reducing CPU and bandwidth, and make the interactive response really smooth for the wind animation.

Some part of the work is done with client-side javascript - turning friends on and of and the wind display.

But not too much client-side javascript, as it tends to become sluggish at some point, as with the google maps api. Also no standard js libraries like prototype or jquery. They make the life of the developer easier at the price of slower performance. The code is self-contained in one 1000-line file.

Fast and light HHTP server. Apache tends to eat a lot of memory and CPU, and is not the best choice for loaded sites. I have used lighttpd, but there are other alternatives too.

The performance problem during Leg 7, described in the About page was caused partly by lack of locking of the chart data. When everyone hit the reload button at 11:02 the site did not respond for some time, and then everyone wold hit reload again, adding new tasks to process. Adding some locking logic that prevents the same data being processed twice and that helped the response time to acceptable 20-30 seconds even during peak time.

Post by zezo | 2009-06-09 | 15:33:23

Conclusion:

I'd estimate that the above mentioned tricks reduce the processing and bandwidth demands of the site at least 10 times, maybe even more.

That makes big difference, as cost is brought down from something like 10000 to below 1000 EUR and makes it possible for me to run it as free service whithout external funding.

Organizations don't usually think that way. They pay EUR 20000, because that is nothing, compared to the total budget.

I hope that someone will find this article interesting or at least educating. I've left out many details, so feel free to ask additional questions.

Post by Vigilante | 2009-06-09 | 17:24:30

I have read all with awe.....!

Post by 63_Buri | 2009-06-10 | 00:25:26

Great write-up. Makes me long back to the days of limited memory on the Apple IIe and the 6502. Love your optimization efforts. If you ever need a reference in the English speaking world - let me know. :)

Post by Vigilante | 2009-06-10 | 04:01:01

Apple IIe? That was ours, too! Our 2nd was a Dec Rainbow which ran both CP/M and MS-Dos.

Post by Frankfurt | 2010-01-27 | 01:41:46

I only found this topic now. You really did a tremendous job, Cvetan. I love your tool.

Post by BigHorn | 2010-02-02 | 22:41:15

I loved this story. Thanks Cvetan

Post by fab | 2011-11-10 | 23:57:07

Great (Old) Post.

I've got a few questions Cvetan.

How can you reduce the problem to "only" 5*10^6 positions / 24H ?

what rule do you use for selecting n "best point" at each step ? (Closest from target ? )
what rule do you use for selecting direction at each step ? (VMG Max UP/DOWN, BS MAX, VMC MAX )

The performance of your work is really impressive.

PS : Thx for the tip of using sse_mathfun.

Post by Cags | 2011-12-17 | 17:39:24

Hey fab, I had similar questions. I'm not answering for Cvetan here, but here's how I solved the problem. Each iteration produces a new set of positions at each point on the compass. At each iteration, a set of best points is chosen for the next iteration. These are the furthest away from the start position and essentially construct the isochrone. There are many points not as far away which need to be removed from the next step which keeps the sample space small. Programatically, I have an array which keeps tabs on the distance from source at each point on the compass rose. It is only the final iteration which gives the optimum point (closest to the destination) and all other points are retrieved backwards from there. Like you I would like to know how zezo approached the problem but this one worked for me. As it happens, the results are remarkably similar to zezo in the short term but it has most definately pushed me north on leg 2.

Post by fab | 2011-12-31 | 01:02:35

Hi Cags, I solved the problem in a similar way. I've another question which Land/Water mask do you use ?
I use the Nasa Land mask (1' resolution) found here: http://mirrors.arsc.edu/nasa/landmask/ but zezo said that he converted it to an object staticaly linked with the executable. It seems that the bitmap is really huge, have you a tip for reduce the memory footprint and keep good performances ?

Post by zezo | 2012-01-04 | 14:29:32

I use the same approach for reducing the points as described. It works fairly well most of the time, but needs some tuning. The main problem is running long forecasts with small steps - at some point the step becomes smaller than the linear size of a 1-degree sector and the track can't change sectors, which is crucial for the algorithm to work. It's also a problem with weak winds far away, because they reduce the distance traveled in one step. But it's not a bad tradeoff in the end.

I keep the chart as 1-bit array, which is the most compact representation for a raster image, and the check is really fast (it's performed in the inner loop). This is for 2' grid, 7-megabyte image:

bit = round(30*lon) + 360*30*round(30*(90 - lat));
return ground_bin[bit/8] & (0x80 >> (bit%8));