Products and services
> TESLA GPUs
> Fermi Architecture Overview
NVIDIA’s Next Generation CUDA Compute Architecture, Code-Named “Fermi”
The Fermi architecture is the most significant leap forward in GPU architecture since the original G80. G80 was our initial vision of what a unified graphics and computing parallel processor should look like. GT200 extended the performance and functionality of G80. With Fermi, we have taken all we have learned from the two prior processors and all the applications that were written for them, and employed a completely new approach to design to create the world’s first computational GPU. When we started laying the groundwork for Fermi, we gathered extensive user feedback on GPU computing since the introduction of G80 and GT200, and focused on the following key areas for improvement:
• Improve Double Precision Performance—while single precision floating point performance was on the order of ten times the performance of desktop CPUs, some GPU computing applications desired more double precision performance as well.
• ECC support—ECC allows GPU computing users to safely deploy large numbers of GPUs in datacenter installations, and also ensure data-sensitive applications like medical imaging and financial options pricing are protected from memory errors.
• True Cache Hierarchy—some parallel algorithms were unable to use the GPU’s shared memory, and users requested a true cache architecture to aid them.
• More Shared Memory—many CUDA programmers requested more than 16 KB of SM shared memory to speed up their applications.
• Faster Context Switching—users requested faster context switches between application programs and faster graphics and compute interoperation.
• Faster Atomic Operations—users requested faster read-modify-write atomic operations for their parallel algorithms.
With these requests in mind, the Fermi team designed a processor that greatly increases raw compute horsepower, and through architectural innovations, also offers dramatically increased programmability and compute efficiency. The key architectural highlights of Fermi are:
• Third Generation Streaming Multiprocessor (SM)
o 32 CUDA cores per SM, 4x over GT200
o 8x the peak double precision floating point performance over GT200
o Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps
o 64 KB of RAM with a configurable partitioning of shared memory and L1 cache
• Second Generation Parallel Thread Execution ISA
o Unified Address Space with Full C++ Support
o Optimized for OpenCL and DirectCompute
o Full IEEE 754-2008 32-bit and 64-bit precision
o Full 32-bit integer path with 64-bit extensions
o Memory access instructions to support transition to 64-bit addressing
o Improved Performance through Predication
• Improved Memory Subsystem
o NVIDIA Parallel DataCacheTM hierarchy with Configurable L1 and Unified L2 Caches
o First GPU with ECC memory support
o Greatly improved atomic memory operation performance
• NVIDIA GigaThread Engine
o 10x faster application context switching
o Concurrent kernel execution
o Out of Order thread block execution
o Dual overlapped memory transfer engines