We discuss the full single-threaded processing pipeline, then study the performance impact of speculation on one small program. Finally, we use mmap() to improve the throughput of our running example “closest”, by caching…
We continue the “closest” running example, identifying several more performance bottlenecks. Finally, we make a first attempt at parallelizing the program, which after fixing a race condition brought runtime from 0.7 seconds single-threaded…
A first look at how the memory hierarchy, primarily caches, store buffers and RAM, support out-of-order execution at the same time as concurrent access to memory by threads running on different CPUs, while providing…
We discuss the OpenCL execution model, and run several experiments to understand its semantics better. Watch out: the interpretation of an experiment with __local memory is incorrect – addressed in Lecture 24.
We review a misunderstanding from the previous lecture, then design a small OpenCL program to experiment with performance trade-offs in GPU programming.