The amount of time taken for computers to access individual data words in memory is becoming an acute problem, greatly restricting computing speeds, especially for shared-memory supercomputers. As instruction cycle times shrink toward 10 picoseconds, corresponding to processor clock rates as high as 100 GigaHertz (GHz), the latency to memory will increase from 200 cycles for 5 GHz today to 4,000 cycles per memory reference unless every processor has its own on-processor-chip cache. However, for supercomputers with more than 1,024 processors, it is virtually impossible to have one cache per processor. The logic to keep hundreds of copies of the same memory datum consistently valued when many processors are changing the value simultaneously is too complex for fast instruction execution rates.
We propose to use a combination of shared local caches and partial store ordering to bridge deep latency problems. Having highly shared local caches will ameliorate the cache conherence problem. Partial store ordering offers partially centralized control that can consistently order conflicting value changes for the same memory location. Together, they should lessen the problems caused by long ("deep") latency for accesses to shared memory data words in fast multiprocessors.
We need the Seawulf cluster to extend our simulations to determine application program execution rates for large memory access delays as we search for the most generally effective combination of computer architectural features to solve the deep latency problem for tomorrow's shared-memory supercomputers. We have requested time to keep eight processors busy for the nine months through June 2007.