Many folks hear the words "GPU" and "best" in the same sentence and automatically think NVidia. But are they right? Others swear by AMD (formerly ATi), but is there any merit to their claims? Let's take a look...
CUDA and Stream Processors: What are they and what is the difference?
CUDA: What it really is, means and does
Notice, as you read, that "CUDA" is really just an architecture; there's no special software; no "support" required; it's just another way of putting together one's electronic Legos. And for those morons who think otherwise? Laugh at their ignorance and ignore their every word; after reading this, you'll know better.
**Also take note that there's no special hardware code for "PhysX" either. Just one more marketing scheme, tossed your way to make you feel like you simply must purchase the company's product.
How it Works:
Previous versions of the CUDA architecture included 8 scalar* cores per Streaming Multi-processor (SM), whereas "Fermi" supports up to 32. Previous versions of CUDA also allowed 32 threads to be scheduled at a time - called a "Warp" - but maxed out at 24 warps active in one SM. Fermi introduced a Dual Warp Scheduler to the picture, allowing 64 threads to be scheduled at a time... but is still limited to the maximum of 24 warps active per SM.
Fermi also added on-chip shared memory. In this architecture, it's 64KB of shared memory per SM that can be configured as 48KB of shared memory with 16KB of L1 cache or 16KB of shared memory with 48KB of L1 cache.
There were some additional enhancements as well, such as:
The addition of SFU (Special Function Units) to execute transcendental instructions like sin, cosine, reciprocal and square root. These are limited to one instruction per thread, per clock (a warp executes 8 clocks, with the ability to do more). The SFU's aren't coupled to the dispatch unit, meaning the dispatch unit can issue instructions to other execution units while the SFU is occupied.
Improved Double Precision Performance for Floating Point Calculations.
ECC Support - Error Correcting Code - this allows GPU computing users to deploy large numbers of GPUs with a lessened fear of memory errors due to electro-magnetic interference. Which, of course, in the GPUs that employ this, means greater accuracy in the functions performed... though at the cost of some speed.
Faster Context Switching.
Faster Atomic Operations.
C++ support - albeit with some caveats. Like specialized libraries, variables, extensions and restrictions. So C++ will work, but not the exact same C++ every programmer is trained to code in. Only NVidia's special variation of C++.
Now, I know all of that is a bit technical, but what it boils down to is this: Think of CUDA cores as identical little robot workers. Each of them is capable of the same exact task as his neighbor - this is known as "scalar". As long as the tasks being handed to a group of them requires the same functions to be performed, the little guys are brutally efficient. Fermi just made them a little smarter and gave us a lot more of them to work with. The obvious drawback to this, of course, is that this type of design is a bit like the American muscle cars of yore - simple, powerful and gas guzzling. NVidia's offerings are much the same: the design is relatively simple, brutally powerful and insanely power-hungry.
And I use the term "brutally" appropriately, as NVidia's approach is what everyone in the hardware design industry would refer to as "Brute Force Computing". Which is a label that has garnered more negative connotations than it should. As you will soon find out by reading my blogs.
Yes, I know, these sound like some pretty awesome revisions. But think about it for a moment: Aren't these just improvements upon a logical thought process that anyone would've been able to see and/or expect after watching "Fermi" perform with the current evolution of technology? Yes, yes it is. These are exactly the revisions one would and should expect. And this is the good part; I haven't gotten to the bad part yet.
The bad news is that NVidia decided to remove/disable the core parts from Kepler's consumer side of products related to workstation applications. While that means that the consumer cards are more streamlined and - as a result - more efficient at consumer level tasks (like video games), it also means that anyone who'd hoped to be able to continue to purchase NVidia's consumer products and get workstation-like performance out of the more reasonably priced cards are in for a very rude awakening. Because in this regard, Kepler is complete trash. What does this mean? Significantly reduced FP calculations, little to no viewport acceleration in 3D editors and drastically reduced scientific calculations. Don't like it? Better buck up, save your money and buy yourself that spankin' - new $6,000 workstation card; because without it, you sir, are royally f***ed if you insist on NVidia.
Kepler did, however, add multi-monitor support without being forced to run a SLI configuration. A first for NVidia. One that I applauded profusely and loudly.
*Scalar - being ladder-like in arrangement or organization; representable by its position on a line.
ATi/AMD, much like NVidia, is a bit close-lipped about their architectures with the general public, but here's as much of the skinny as you'll ever find.
4 of these ALUs are capable of 1 FP MAD/ADD/MUL or 1 INT ADD/AND/CMP as well as integer shifts. The fifth unit adds INT MUL support and FMA - (Fused Multiply/Add) which NVidia didn't add until Fermi - while the sixth supports all of the above and simultaneously manages its five neighbors. As long as there is sufficient ILP - Instruction Level Parallelism - in the instruction stream and the compiler is optimized to support it, this means that up to 5 instructions can be co-issued per ALU block - as opposed to the single instruction per block that NVidia cards are capable of supporting. This particular architecture also has 32KB Local Data Shares and 64KB Global Data Shares - something else NVidia didn't add until "Fermi".
Here comes the Achilles Heel: What this essentially means is that, for ATi Stream to be as productive as it's capable of being, the program that is utilizing it must be optimized for the architecture specifically. The upside of this "Achilles Heel" is that the GPU is arranged in SIMD engines (remember what that means from our discussion of Hyper-Threading?) similar in many regards to a CPU, which allows for heterogeneous computing - or the ability of the GPU to perform the exact same (or similar) calculations as the CPU - between the CPU and GPU - something NVidia still doesn't possess even after it's "Kepler" revision to "Fermi".
AMD "GCN" - Graphics Core Next
Support for x86 addressing with unified address space for CPU and GPU
- 64-bit addressing
- GPU has the ability to send interrupts to the CPU on various events (such as page faults)
Usage of RISC SIMD instructions instead of VLIW MIMD
Support for Partially Resident Textures, which enables virtual memory support through DirectX and OpenGL extensions
"PowerTune" support, which dynamically adjusts performance to stay within a specific TDP
Usage of Liquid-chamber cooling technology over Vapor chamber
3GHz HDMI Support
Multi-Stream hardware-based H.264 encoding with Video Code Engine (VCE)
In addition, AMD replaced the SIMD engines with CUs - or Compute Units - which are scalar in nature and are paired with Texture Fetch, Texture Filter, Branch & Message and Vector units. Each of the CU has four texture units tied to a 16KB cache that is read/write. Historically, L1 was only used to read textures (including all of NVidia's current offerings); but now read and write functions can go back a forth through the same cache, enhancing efficiency. Each CU also features its own scheduler, 4KB Scalar Registers and 64KB Local Data Share as well as 4x 64KB Vector Registers.
According to AMD, GCN supports full C++ with no special restrictions while offering optimized extensions, libraries and variables for enhanced performance. Apparently, the specialized extensions, libraries and variables aren't necessary for programming for GCN but can certainly be used to optimize the performance. Add this to the fact that GCN has enhanced the former ATi Stream technology, made it scalar, left VLIW4 (Very Long Instruction Word version 4) behind, all while adding to the functionality of the architecture and optimizing it overall makes GCN one scary customer indeed.
The truth about it is: GCN wasn't designed to compete with "Kepler" but rather to outpace "Fermi" and bring AMD's graphics card line into a much more profitable and easier to implement future. The fact that it is so robust that it keeps up with NVidia's Kepler revision of Fermi (about 15-30% under/over, depending on the particular test or application) is just further testament to the design's superiority. If AMD had equal time to revise GCN the same number of times NVidia has revised Fermi... well, the implied results for we - the consumer - is positively mind-boggling. The fact that AMD managed to do all of this while keeping all of the components that allow workstation application and calculation - all at a fair price (for the most part) - further go to show their desire to be both the best and most beloved.
The Bottom Line
For shame AMD; because of your greed I was forced to wait for - and for the very first time, purchase - a mid-range video card featuring the glorious new architecture (actually, I got 2 for CFX so I'd get similar, if not better, performance for ~ $80 cheaper). I've never had to go mid-range before; it was - and is - a bit humiliating. Which brings me to my most major gripe with everything AMD/ATi has manufactured in the last few years: to get anything above Dual-Crossfire GPUs, you have buy the high-range cards; which entirely defeats the purpose of having multiple GPUs. Why on Earth wouldn't AMD have seen fit to enable at least their mid-range products to be able to Quad-Crossfire? Do they not see how that could financially benefit them in the long-run? Silly, silly designers.
Perhaps a Rectal-Craniotomy is in order... for both NVidia and AMD. So that they may their heads surgically removed from their bums.
What, then, should you buy? As always, I ask the following in response:
What are you planning on doing?
What is your budget?
How hot do you want your PC to get?
What are the rest of the components in your system going to be?
What programs/games will you be running?
What kind of performance is most important to you (and yes, performance can be broken down into distinct groups)?
How much of a Hammer Legion Member are you, AMD or NVidia?
These are questions you simply must answer before choosing any set of components for a PC build. If you're one of these morons who thinks it's okay to simply buy the most expensive thing there is and that you'll get the best performance that way, then I feel sorry for both your stupidity and the substantial and unnecessary loss your wallet will suffer. Optimize, people; Optimize. You'll be more satisfied with your computing experience and still have plenty of money left over to take your sweetheart out to that nice restaurant like she deserves. Or buy your Mom that necklace she always wanted. Or buy yourself that bevy of games you've been drooling over.
But, hey; what do I know? I'm just a hardware technician.