A detailed study of the parallel performance of the interpolation supplemented lattice Boltzmann (ISLB) method using SHMEM and MPI on the Cray T3E-900 and Cray X1 architectures is presented. The noteworthy feature of the present implementation of the ISLB method is that it is able to achieve a sustained speed of 4.2 Tflop/s while using 504 processors on a Cray X1. The code is shown to achieve super-linear speedups on the Cray T3E-900. It is shown through detailed profiling that the computation and the communication scale well on the Cray X1, although the overall speedup is adversely affected by the cost of barrier synchronization. © 2006 Sage Publications.