And AMD/ATI called it stream processors since the X1000 series. No bells ringing yet?
The difference being that AMD actually changed from VLIW5/4 to individual stream processors, while NVIDIA didn't.
http://cinwell.wordpress.com/2013/09/06/evolution-of-gpu-gt80-gt200-fermi-kepler/
G80: November 2006. Initial vision of what a unified graphics and computing parallel processors should look like. The first GPU to support C, the first to use unified processor and utilize a scalar thread processor, and it introduced SIMT execution model (multiple independent threads execute concurrently using a single instruction) and introduced shared memory and barrier synchronization for inter-thread communication.
GT200: June 2008. As a major revision to the G80 architecture, GT200 mainly extended the performance and functionality of G80. It increased #streaming processor cores from 128 to 240. Each processor register file was doubled in size, allowing a greater #threads to execute on-chip at any given time. Hardware memory access coalescing wad added to improve memory access efficiency. Double precision floating point support was also added.
Fermi: 2010. The Fermi architecture is the most significant leap forward in GPU architectures since the original G80. The key areas for Fermi to improve were gathered from user feedback on GPU computing since the introduction of G80 and GT200: (1) Improve Double Precision Performance; (2) ECC Support; (3) True Cache Hierarchy: some parallel algorithms were unable to use the GPU’s shared memory, and users requested a true cache architecture to aid them; (4) More Shared Memory: many CUDA programmers requested more than 16 KB of SM shared memory to speed up their applications; (5) Faster Context Switching; (6) Faster Atomic Operations: users requested faster read-modify-write atomic operations for their parallel algorithms.
New features in Kepler GK110:
Dynamic Parallelism: adds the capability for the GPU to generate new work for itself, synchronize on results, and control the scheduling of that work via dedicated, accelerated hardware paths, all without involving the CPU.
Hyper-Q: enables multiple CPU cores to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and significantly reducing CPU idle times. Hyper‐Q increases the total number of connections (work queues) between the host and theGK110GPUby allowing 32 simultaneous, hardware‐managed connections (compared to the single connection available with Fermi).
Grid Management Unit: enabling Dynamic Parallelism requires an advanced, flexible grid management and dispatch control system. The new GK110Grid Management Unit (GMU) manages and prioritizes grids to be executed on the GPU. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The GMU ensures both CPU‐ and GPU‐generated workloads are properly managed and dispatched.
GPU Direct: it is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory.
SMX Architecture:
--------------------------------------------------------------------------------------------
All the improvements are around the cuda cores, how they are organized, how they are fed, the caches available.
I'm not trying to minimize the evolution of the NVIDIA architecture and its performances and feature gains.
I'm just emphasizing how versatile and flexible the CUDA stream processors are.
We all know the WLIW5/4 could have awesome performance if properly coded for but if not much of its power would be lost and that is why AMD changed to GCN.