it framerate:fps
Drag mouse to rotate model. Hold shift key or use mouse wheel to zoom it (canvas is matched to the browser window). N = 4096. The script makes 2×it = 20 iterations per frame. Framerate is calculated after every 20 iterations. Be careful with overclocking, this script overheats my videocard quickly.

N-Body toy

N-Body dynamics
    d 2ri / dt 2 = Fi = G m ∑ k (rk - ri ) / |rk - ri|3
for N = 4096 are simulated. We put G m = 1 by the time scaling. Explicit scheme
    r t+2 = 2 r t+1 - r t + F t+1 Δt2
with constant time step Δt is used.
As bodies approach each other, the force between them grows without bound, which is an undesirable situation for numerical integration. To preclude collisions softening factor ε = 0.1 is added [1], this is reasonable if the bodies represent galaxies that may pass right through each other. The denominator is rewritten as follows:
    Fi = ∑ k (rk - ri ) / (|rk - ri|2 + ε2)3/2.
So this simulation is not absolutely accurate but funny. You can see clasters formation from a slowly rotating random ring.

See also N = 8K simulation. I can't make a spiral galaxy (I'm not an expert :)

Performance results. 400 Gflops in your home PC!

I've got 2×1010 interactions per second, i.e. 20-times speedup after I've replaces 1D textures by 2D ones. Thanks to Lars Nyland (NVIDIA) private message:
"From what I know of textures, the 1d textures are very slow compared to 2d. In 2004, when we did our Cg version of n-body, we converted the index i of a body into a pair (ix,iy) and used a 2d texture to look up the important attributes of the body. For something like 4096 bodies, we used a 64*64 texture. This was much faster than you would have expected."
It is similar to the CAL based simulations without loop-unrolling [2]. N-Body is an amazing but a little special application.
N =   4096     16K     32K
HD 4870   1200=20×60  
20
80=4×20
21.5
21=2.1×10
22.5
HD 4650   420=21×20
7
28=14×2  
7.5
5=2.5×2
5.4
You see above:
    integration time steps per second = fps × iterations per frame   and
    integration time steps per second × N 2 / 10 9   i.e.   interactions per second / 10 9
Multiply interactions per second by (about) 20 floating point operations to get Flops.

HD 4870   800 shaders, 790gpu/1100mem MHz, 256 bit bus width, GDDR5 memory
HD 4650   320 shaders, 712gpu/550mem MHz, 128 bit bus width, GDDR3 memory

Numbers: shaders power ratio is 800×790 / 320×712 = 2.8 and performance ratio is 20/7 = 2.9 so GDDR3 memory bandwidth is not a bottleneck in this simulation!

The algorithm outline

For N = 64×64 = 4096 bodies 3D coordinates are stored in two 64×64 RGBA float texture2D. The main computations are made in this shader. The for cycle sums up all forces.
<script id="shader-fs" type="x-shader/x-fragment"> 
precision highp float;
  uniform sampler2D samp;
  uniform sampler2D samp1;
  varying vec2 vTexCoord;
  const float d = 1./64., e2 = .01,  dt2 = .00001;
void main(void) {
   vec3 r = texture2D(samp, vTexCoord).xyz;
   vec3 r1  = texture2D(samp1, vTexCoord).xyz;
   vec3 f = vec3( 0. );
   for(float y = 0.; y < 1.; y += d ){
    for(float x = 0.; x < 1.; x += d ){
     vec3 v = texture2D(samp1, vTexCoord + vec2(x, y)).xyz - r1;
     float a = dot(v, v) + e2;
     f += v/(a*sqrt(a));
    }
   }
   r = 2.*r1 - r + f*dt2;
   gl_FragColor = vec4(r, 0. );
}
</script> 
This vertex script renders the 2×2 square into FBO to make computations for every pixel in textures.
<script id="shader-vs" type="x-shader/x-vertex"> 
  attribute vec2 aPos;
  attribute vec2 aTexCoord;
  varying   vec2 vTexCoord;
void main(void) {
   gl_Position = vec4(aPos, 0., 1.);
   vTexCoord = aTexCoord;
}
</script> 
N points square lattice is used to show the bodies in the shader below
<script id="shader-vs-show" type="x-shader/x-vertex"> 
  attribute vec2 aPoints;
  uniform mat4 mvMatrix;
  uniform mat4 prMatrix;
  uniform sampler2D uTexSamp;
  varying vec4 color;
void main(void) {
   gl_Position = prMatrix * mvMatrix * texture2D(uTexSamp, aPoints );
   color = vec4( 1.);
}
</script> 
Computations are organized by the function
function draw(){
   gl.viewport(0, 0, 64, 64);
   gl.useProgram(prog);
   for(var i = 0; i < it; i++){
     gl.uniform1i(sampLoc, 0);
     gl.uniform1i(samp1Loc, 1);
     gl.bindFramebuffer(gl.FRAMEBUFFER, FBO2);
     gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);

     gl.uniform1i(sampLoc, 1);
     gl.uniform1i(samp1Loc, 2);
     gl.bindFramebuffer(gl.FRAMEBUFFER, FBO);
     gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);

     gl.uniform1i(sampLoc, 2);
     gl.uniform1i(samp1Loc, 0);
     gl.bindFramebuffer(gl.FRAMEBUFFER, FBO1);
     gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
   }
   drawScene();
   frames++;
}
at last drawScene() prepares transformations and executes
  gl.drawArrays(gl.POINTS, 0, 64*64);
to show scene.

[1] GPU Gems 3. Chapter 31. Fast N-Body Simulation with CUDA L.Nyland, M.Harris, J.Prins
[2] Kazuki Fujiwara, Naohito Nakasato Fast Simulations of Gravitational Many-body Problem on RV770 GPU arXiv:0904.3659v1
[3] S.F.P.Zwart, R.G.Belleman, P.M.Geldof   High-performance direct gravitational N-body simulations on graphics processing units. New Astronomy 12 (2007) 641-650
[4] Vanderbei's N-Body Page 1000 "Star Clusters" Java simulation
[5] WebGL Orrery by Ilmari Heikkinen


Simulations on GPU
updated 23 Jan 2011