A realtime Mandelbrot zoomer in SSE assembly and CUDA

(Written around 2002, enhanced with OpenMP and CUDA in 2009)

Yep, looks boring when it doesn't move...

Update, Dec 2022: Implemented in-browser WASM version here.

Back in 2002, I wanted to try out SSE programming. I bought and installed an Athlon-XP based motherboard, and over the better part of a weekend, I created a simple implementation of a Mandelbrot zoomer in native SSE assembly. I was happy to see that it was almost 3 times faster than pure C - even when compared to pipelined floating point calculations.

Many years later, I revisited the code and enhanced it to use the GNU autotools; it therefore became portable to all Intel platforms under the sun (Linux/x86, Mac OS/X, Windows (via MinGW gcc), OpenBSD/amd64 and FreeBSD/amd64). Benchmarks showed - as expected - that OSes had very little to do with such "pure" calculations as this; speed was pretty much the same under all of them (I used libSDL for portable HW-assisted "blitting" of the pixel data).

Recently (2009), I learned how to use OpenMP and Intel TBB to make use of additional CPUs/cores - so I added OpenMP #pragmas to both the pure-C and SSE modes... and smiled as I saw my little Atom330 get a boost from 58 to around 150 frames per second. The manually written SSE code was also using GCC's AT/T syntax, so it could be automatically inlined by GCC even at -O3 optimizations.

I thought that this was the limit - what else could I do now but to sit back and wait for CPU cores to multiply?... Perhaps buy an i7?

Nope. Wrong.

Last weekend, I got to play with an NVIDIA GT240 (around 100$). Having read a lot of blogs about GPU programming, I downloaded the CUDA SDK and started reading some samples.

In less than one hour, I went from my rather complex SSE inline assembly, to a simple, clear Mandelbrot implementation... that run... 15 times faster!

Let me say this again: 1500% faster. Jaw dropping. Or put a different way: I went from 147fps at 320x240... to 210fps... at 1024x768!

I only have one comment for my fellow developers:

The algorithm in question was perfect for a CUDA implementation. You won't always get this kind of speedups (while at the same time doing it with clearer and significantly less code).

But regardless of that, you must start looking into GPGPU coding: CUDA (and soon, OpenCL) can become huge game changers.

If you don't, you are in danger of missing... significant opportunities...

Update, one week later: It turned out that my GT240 is so ...fresh, that the Linux drivers aren't yet up to speed - they are driving my card at Performance Level 1 instead of 2 (i.e. at 324MHz memory clock instead of 1700MHz!). Under Windows, the code runs at an astonishing 420 frames per second, so the speedup is not 15x, it is in fact more like 30x...

Update, March 2010: The newest NVIDIA drivers (195.36.15) have corrected the bug - the Linux and Windows versions now run at identical speeds: Blazing ones :‑)

Downloads

SSE version

The portable GPL source code for the SSE version is here.

UPDATE: Much better code, utilizing the XaoS algorithm and AVX instructions in my Github repo.

Compile and run with...

./configure && make && ./src/mandelSSE

You can zoom-in/zoom-out with the left and right mouse buttons. Hit ESC and it will report the frames per second achieved.

If you are a Windows user, here is a Win32 package (cross-compiled via MinGW). To use it, simply unzip the package and run either one of the two .bat files. The 'autopilot' one zooms in a specific location, while the other one allows you to zoom interactively using your mouse (left-click zooms in, right-click zooms out).

CUDA version

The CUDA package (portable GPL source code), is also available. The program also uses OpenGL to blit the CUDA-generated data on the screen - this way there's no moving of pixel data back-and-forth between the CPU and the GPU: they are calculated on the GPU global memory, and they are drawn from that (via a texture).

If you are working under Windows, a precompiled Windows binary is here (compressed with 7-zip). Make sure you disable Vertical Sync in order to get the maximum frame rate.

Code comparison

Just so you understand what I meant by "getting a speedup with clearer and simpler code", this is the CUDA CoreLoop, from inside src/mandel.cu: if you have ever coded a Mandelbrot zoomer, you'll probably agree that this is one of the simplest ways to implement it:

__global__ 
void CoreLoop(
    int *p, 
    float xld, float yld, /* Left-Down coordinates */
    float xru, float yru, /* Right-Up coordinates */
    int MAXX, int MAXY)   /* Window size */
{
    float re,im,rez,imz;
    float t1, t2, o1, o2;
    int k;
    unsigned result = 0;
    unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
    int y = idx / MAXX;
    int x = idx % MAXX;
    
    re = (float) xld + (xru-xld)*x/MAXX;
    im = (float) yld + (yru-yld)*y/MAXY;

    rez = 0.0f;
    imz = 0.0f;

    k = 0;
    while (k < ITERA) {
	o1 = rez * rez;
	o2 = imz * imz;
	t2 = 2 * rez * imz;
	t1 = o1 - o2;
	rez = t1 + re;
	imz = t2 + im;
	if (o1 + o2 > 4) {
	    result = k;
	    break;
	}
	k++;
    }

    p[y*MAXX + x] = lookup[result]; // Palettized lookup
}

Now compare that to the SSE main loop (brace yourselves):

        ;  x' = x^2 - y^2 + a
        ;  y' = 2xy + b
        ;
        mov     ecx, 0
        movaps  xmm5, [fours]     ; 4.     4.     4.     4.       ; xmm5
        movaps  xmm6, [re]        ; a0     a1     a2     a3       ; xmm6
        movaps  xmm7, [im]        ; b0     b1     b2     b3       ; xmm7
        xorps   xmm0, xmm0        ; 0.     0.     0.     0.
        xorps   xmm1, xmm1        ; 0.     0.     0.     0.
        xorps   xmm3, xmm3        ; 0.     0.     0.     0.       ; xmm3
loop1:
        movaps  xmm2, xmm0        ; x0     x1     x2     x3       ; xmm2
        mulps   xmm2, xmm1        ; x0*y0  x1*y1  x2*y2  x3*y3    ; xmm2
        mulps   xmm0, xmm0        ; x0^2   x1^2   x2^2   x3^2     ; xmm0
        mulps   xmm1, xmm1        ; y0^2   y1^2   y2^2   y3^2     ; xmm1
        movaps  xmm4, xmm0
        addps   xmm4, xmm1        ; x0^2+y0^2  x1...              ; xmm4
        subps   xmm0, xmm1        ; x0^2-y0^2  x1...              ; xmm0
        addps   xmm0, xmm6        ; x0'    x1'    x2'    x3'      ; xmm0
        movaps  xmm1, xmm2        ; x0*y0  x1*y1  x2*y2  x3*y3    ; xmm1
        addps   xmm1, xmm1        ; 2x0*y0 2x1*y1 2x2*y2 2x3*y3   ; xmm1
        addps   xmm1, xmm7        ; y0'    y1'    y2'    y3'      ; xmm1
        cmpltps xmm4, xmm5        ; <4     <4     <4     <4 ?     ; xmm2
        movaps  xmm2, xmm4

; at this point, xmm2 has all 1s in the non-overflowed pixels

        movmskps eax, xmm4        ; (lower 4 bits reflect comparisons)
        andps   xmm4, [ones]      ; so, prepare to increase the non-over
        addps   xmm3, xmm4        ; by updating the 4 bailout counters
        or      eax, eax          ; have all 4 pixels overflowed ?
        jz      short nomore      ; yes, we're done

        inc     ecx
        cmp     ecx, 119
        jnz     short loop1

Convinced? :‑)

Not only was this SSE code a pain to write, and impossible to maintain... It is now an order of magnitude slower than what you can get with cheap GPUs!

Times have changed (yet again)...

Update: The XaoS algorithm improves things by two orders of magnitude...

Check the Git repository for details.
Executive summary: 100s of frames per second with huge windows and interactive zooms...

Index

Updated: Sat Dec 3 19:52:56 2022