It's no secret that the recent leaps in AI, including ChatGPT and Stable Diffusion, are impressive. They can create text, images, video, and more, all based on a text prompt, with little user input at all. The other thing they have in common is that they all run in the cloud, so they're on somebody else's computer, and can be expensive once subscription fees are taken into account. To save some money, many AI tasks can run on your home computer, from LLMs to the datasets that train them.
But exactly what hardware do you need to power these advanced algorithms? Well, while you don't necessarily need the best CPU to run deep learning tasks, you probably will need one of the best graphics cards. That's because the new technology in GPUs, like Nvidia's Tensor cores, are custom designed for accelerating AI tasks. You'll also want a hefty amount of VRAM, so that more data can go into active memory, which saves time while training. That's why we've put this list together of the best GPUs for deep learning tasks, so your purchasing decisions are made easier.
Our picks of the best graphics cards for deep learning use
MSI GeForce RTX 4070 Ti Super Ventus 3X
Plenty of power with a relatively affordable price tag
- Memory
- 16GB GDDR6X
- Boost Speed
- 2640 MHz
- CUDA Cores
- 8448
- Architecture
- Ada Lovelace
- Memory bus width
- 256-bit
The MSI GeForce RTX 4070 Ti Super Ventus 3X features fourth-generation Tensor cores, which are purpose built for accelerating AI tasks. It's a cost-effective way of getting into deep learning models, and it also has plenty of VRAM to keep up your demanding tasks.
- 264 fourth-generation Tensor cores
- 16GB of GDDR6X VRAM
- Huge heatsink with three fans
- Requires a hefty PSU and cooling
One of the most important things to consider while shopping for a graphics card for deep learning tasks is how many AI accelerator cores it has onboard. These cores, in case you are wondering, can perform very efficient matrix multiplication, which is the most time-intensive type of calculation in any deep neural network. As an Nvidia 40-series GPU, the RTX 4070 Ti Super card comes with 4th-gen Tensor Cores, which are significantly better than the first-gen Tensor Cores that were introduced with the Tesla V100 server card.
MSI GeForce RTX 4070 Ti Super Ventus 3x review: The awkward middle child
With its middling performance, the RTX 4070 Ti is having an awkward middle-child moment.
This particular RTX 4070 Ti Super model from MSI comes with 264 Tensor Cores, which is a bit higher than how much you get on the regular RTX 4070 Ti. These 4th-gen Tensor cores are so fast that they sit idle roughly half of the time during GPT-3-sized training, with the bottleneck being how fast data can arrive from global memory. It also has 16GB of GDDR6X VRAM with 672GB/s bandwidth to quickly fill the Tensor Cores back up with data. It may not deliver the kind of performance you'd expect to see from a server card, but it should be enough for most homelab use cases. The RTX 4070 Ti Super is also a better graphics card overall compared to the RTX 4070 Ti, which we previously highlighted as the best pick.
Nvidia GeForce RTX 4070 Super FE
The best priced GPU with Tensor cores
- Brand
- Nvidia
- Cooling Method
- Air
- Interface
- PCIe 4.0
- Memory
- 12GB of GDDR6X
- Power
- 220W
The Nvidia GeForce RTX 4070 Super is a great 1440p gaming card, but it's also perfect for deep learning tasks like image generation or running local text-based LLMs, as it has a large number of fourth-gen Tensor cores and 12GB of VRAM.
- 224 Tensor cores for accelerating AI workflows
- 12GB of GDDR6X supplying 504.2GB/s of bandwidth
- A great-looking heatsink
- Limited VRAM for AI tasks
The RTX 4070 Super shares a lot of similarities with the RTX 4070 Ti Super, and that means it also has fourth-generation Tensor Cores that are crucial for deep learning workflows. The RTX 4070 Super Founder's Edition model has 224 Tensor Cores, which makes it better than its predecessor that we had previously picked as our best value GPU. The L2 cache is also stands at a respectable 48MB, which is pretty good overall for a card like this. You are only looking at 12GB of GDDR6X VRAM with 504.2GB/s bandwidth, though, which may not be enough for most AI tasks.
Nvidia GeForce RTX 4070 Super review: The best mainstream GPU got better
The Nvidia GeForce RTX 4070 Super brings even more value to the best of the 40-series.
The new and upgraded RTX 4070 Super FE is also faster and more performant than the older model, and that'll definitely make a difference in AT accelerated workflows as well. As for the performance of this GPU outside the AI-accelerated tasks, the RTX 4070 Super FE is a pretty good option to consider, especially for those who are looking to pick something up for 1440p gaming. We had a great time testing our review unit and putting it through its paces, and it was able to hold up pretty well for the most part. It's also relatively easy to power as well as keep this GPU cool at all times, and we also love the overall design and minimal aesthetics Nvidia went with this one.
MSI Suprim Liquid X GeForce RTX 4090
Lots of VRAM and Tensor cores
The MSI Suprim Liquid X GeForce RTX 4090 card is a slim, watercooled variant of Nvidia's flagship, with a 240mm radiator to keep the card cool under any workload. That'll come in handy during AI tasks, which can take significant time to complete.
- 512 fourth-generation Tensor cores for AI tasks
- 24GB of VRAM with 1,008GB/s of bandwidth
- Watercooled for thermal performance
- 450W power requirement
- Need a large case to fit the radiator
When we reviewed the Nvidia GeForce RTX 4090 in its Founder's Edition form at launch, we called it "the untouchable king of performance." Now, that was based on gaming workloads, which it demolished at 4K resolution and provided enough power for 8K resolution gaming if you have the monitor to display it. Now it also comes with a hefty 450W TDP, but we saw it didn't go much over 400W during gaming loads, and with AI workloads having the Tensor Cores sitting idle for roughly half the time, it's a fair bet that it won't get anywhere near that TDP. The FE variant with heatsink and two fans kept under 65 Celsius even under 420W workloads, and this particular MSI Suprim Liquid X comes with an AIO watercooler with a 240mm radiator to wick away heat from the core and memory. I expect it will stay well under that 65C target during workloads, which means your expensive RTX 4090 will last for longer than if it was running hotter.
Nvidia GeForce RTX 4090 review: The untouchable king of performance
There is no other graphics card like the Nvidia GeForce RTX 4090. Its power is unmatched, as is its size and its power consumption.
As for AI workloads, the RTX 4090 has enough power for the trickiest workloads like transformers to train LLMs, with 512 Tensor Cores providing over twice the power as the RTX 4070 Ti. And with 24GB of GDDR6X and a 384-bit memory bus, it brings 1,008GB/s of bandwidth to your deep learning needs. That's double that of the RTX 4070 Ti, and two-thirds of the bandwidth from the substantially more expensive server-class GPUs with Tensor Cores. Make no mistake about it, this is the GPU that you should aim for when doing deep learning, and the only reason it's not getting the top pick in this list is that it's often out of stock everywhere, as companies buy them in pallet loads to run their own AI tasks on.
Nvidia H100
For server-grade tasks
The Nvidia H100 is specifically built for AI accelerated workflows in workstation or server installs, as it doesn't have any graphics output ports. With 80TB of VRAM, it can tackle advanced tasks like transformers or training LLMs for other uses.
- 80TB of VRAM
- PCIe 5.0
- 51 teraFLOPS of FP32 performance
- Costs as much as a midrange sedan
- No fans, so have to rely on server fans
The Nvidia H100 PCIe 80 GB is one of the latest AI-focused professional graphics cards from the company, built to chew through AI accelerated tasks in a server setting with up to eight of these expensive GPUs running in parallel. According to Tim Dettmers, it brings twice the relative performance as the RTX 4090, in 16-bit training, 16-bit inference, and 8-bit inference tasks. With 456 Tensor Cores and 2TB/s of memory bandwidth from 80GB of HBM3 memory, it's also the first GPU to support PCIe 5.0 for faster connections to the motherboard. It also supports NVLink, which directly connects the GPUs together, so they bypass the motherboard and CPU when passing data between them.
With a 350W maximum TDP, it draws power from a 16-pin PCIe cable. The two-slot thermal solution is passive, which is expected for a server-class GPU like this. To use it in a desktop workstation will require some ingenuity for enough airflow to keep it cool. It's not just a hardware solution, as it comes with a five-year subscription to Nvidia AI Enterprise, which is a fully featured AI software platform with over 100 frameworks, pretrained models to get started quicker, and more to help AI professionals do their job. This is the current pinnacle of AI-accelerated GPUs, and is more versatile than the Tensor Processing Units (TPUs) that Google uses in Google Cloud for AI training. The only real drawback to these graphics cards is the price, which is as much as a family car. Then again, for companies invested in AI training, the only thing that matters is the time that AI training can be accomplished in, and that's where the H100 excels.
Nvidia Tesla V100
Server-class AI computing at an affordable price
The Tesla V100 was the first graphics card to feature Tensor cores, which are designed for accelerating AI workflows and deep learning models. It's a few years old at this point but is still capable, and is a great starting point for building a server for deep learning tasks.
- 16GB of VRAM
- 640 first-generation Tensor cores
- Relatively low 350W power requirement
- No active cooling
- No display outputs
The Nvidia Tesla V100 was the first graphics card to feature the Volta architecture, and the very first with Tensor Cores to accelerate AI workflows. The GV100, to use its proper name, has 672 Tensor Cores for accelerating AI calculations. Now, it's worth mentioning that this first-generation Tensor Core isn't directly comparable to the second, third, or fourth generation cores as they were improved and gained added functionality as each new release came out. Still, that's more Tensor Cores than a RTX 4090, and with 16GB of HBM2 memory with a 4,096-bit bus width, pushes 897GB/s of bandwidth. That's a colossal amount, and will work wonders with image generation tasks.
It will struggle with transformers unfortunately, as those tasks are best with at least 24GB of memory to fit the huge datasets they need, but it will still get you going on your deep learning journey. It's also got a relatively low L2 cache of 6MB, so it will be fetching data from global memory more often than newer graphics cards. With a two-slot, passively cooled design, the V100 is usable in workstations or servers, as long as enough consideration for airflow is budgeted for. It's powered by two 8-pin PCIe connectors for a total board power of 350W.
Zotac Gaming GeForce RTX 3090 Trinity OC
Very capable for machine learning
The Zotac Gaming GeForce RTX 3090 Trinity OC is the best value proposition from the Ampere architecture, with many third-generation Tensor cores to accelerate AI tasks and 24GB of VRAM for fairly large data sets.
- 328 third-generation Tensor cores
- 24GB of VRAM with 936.2 GB/s of bandwidth
- Large heatsink with three fans
- Three slot thickness
The Nvidia 3000-series is still a capable force in AI tasks and the GeForce RTX 3090 is one of the best, if you can find one for sale these days. This model from Zotac has a huge 24GB of GDDR6X memory, which, when coupled with the 384-bit bus, means 936.2GB/s of bandwidth for speedy memory access for deep learning datasets. That's higher than most of the other entries on this list, but comes with a price tag to match the performance. With only 6MB of L2 cache, it will be fetching data from global memory more often, but that's going to be helped by the huge bandwidth numbers, so the performance won't suffer that much.
Best GPUs for gaming in 2024
The GPU is arguably the most important part of any gaming PC build, so we're here to help you choose the right one
With a 350W TDP fed by two 8-pin PCIe cables, it will be easy to power and won't require a new ATX 3.0 compliant PSU to run. The large 3-slot heatsink and three fans will keep it cool, especially with the on/off cycle for the Tensor Cores as they wait for more data to be fed from the memory. The older Tensor Cores are less powerful than those in Nvidia's 4000-series, but the more important factor with this card is the 24GB of VRAM, which enables the use of the latest LLM models, and likely, the datasets necessary for some time in the future. That's because until consumer GPUs go higher than 24GB, AI scientists will be aiming to fit their models into that amount of memory.
XFX Speedster MERC310 AMD Radeon RX 7900XTX Black 24G
24GB of VRAM and ROCm support
The XFX Speedster Merc310 AMD Radeon RX 7900XTX Black is a monster of a GPU with 24GB of VRAM that has recent support for PyTorch 2.0.1 and the ROCm open software platform that makes it viable for deep learning. The large memory capacity means it's specially suited for training large language models (LLMs).
- 192 AI accelerators
- 24GB of VRAM with up to 960GB/s of bandwidth
- Large cooler with three fans
- Limited community support as Nvidia is more widespread
The main reason that most of the graphics cards on this list are from Nvidia is that Tensor Cores make a vast difference in how fast GPUs can handle deep learning tasks. While earlier AMD architectures like RDNA and RDNA2 had great silicon with high FP16 performance and high memory bandwidth, the lack of AI accelerators made them a non-starter for professional use. With RDNA3, AMD introduced AI Accelerators, its version of Tensor Cores, with 192 on the flagship Radeon RX 7900 XTX. With 24GB of GDDR6 memory, a 384-bit bus, and 96MB of L3 cache, this graphics card could get up to 3,500 GB/s of memory bandwidth while using Infinity Cache. Those are the four most important requirements for deep learning tasks covered, and we already know that AMD is good for FLOPS performance. The only piece of the puzzle missing is software support, as all the hardware in the world can't help you without something to run on it.
AMD Radeon RX 7900 XTX review: A substantial step-up for RDNA 3
AMD hits back at NVIDIA's mighty RTX 40 series GPUs.
AMD GPUs use ROCm software to provide a way to use the widely used PyTorch framework for building deep learning models. Until RDNA3, it didn't have any AI acceleration for consumer GPUs, so while it was usable, it was slower than alternative graphics cards at the same price. Now with the release of ROCm 5.7.1 for Ubuntu Linux, two consumer GPUs get support to use PyTorch 2.0.1 with acceleration; the Radeon RX 7900 XTX and the Radeon Pro W7900. With the 24GB of VRAM on this card from XFX, you have ample space to train LLMs or other deep learning tasks.
ASRock Phantom Gaming Intel Arc A770
Surprisingly capable, especially at image generation tasks
The ASRock Phantom Gaming Intel Arc A770 is Intel's flagship discrete GPU, with 16GB of VRAM for large data sets, and 512 of Intel's version of Tensor cores for accelerating AI workflows.
- 16GB of VRAM with 512GB/s of bandwidth
- 512 Intel tensor cores
- Relatively low 225W power requirement
- Need access to ReBar or Smart Access Memory for best performance
While AMD graphics cards only recently got support for AI acceleration, the discrete Arc cards from Intel came with the company's own version of Tensor Cores straight out of the gate. These cores are called Intel Xe Matrix Extensions Engines, or Intel XMX Engines for short. On the Intel Arc A770, it comes with 512 XMX Engines, which are used for XeSS upscaling in games that support it. They're also general purpose AI accelerators, and can be used for deep learning tasks. And with 16GB of VRAM with a decent 512GB/s of bandwidth, you can use relatively large models for LLMs or image generation.
Intel Arc A770 review: This is only the beginning
Intel's flagship discrete GPU is perfectly capable delivering solid performance for deep learning tasks.
The new SYCL Joint Matrix Extension makes it so Intel XMX can be used in the same way as Nvidia's Tensor Cores, accelerating deep learning frameworks like TensorFlow and libraries like oneDNN. Intel has a robust developer team that has been cranking out AI tools, drivers, and a full ecosystem of AI software. They've got in-depth guides to get deep learning software, like TensorFlow running on Arc GPUs, or anything else you might need to know. The one big drawback is that you need a relatively new motherboard and CPU that can support Resizable BAR, as Intel has said, the performance of Arc GPUs won't be great without it.
What you need to know about picking a GPU for deep learning tasks
When picking a graphics card for deep learning tasks, it's important to know which specifications are relevant, and in which order they are important. One of the leading voices in making deep learning more accessible is Tim Dettmers, and we used his expert advice for picking our choices. The primary factor should be the number of Tensor cores, which are only found on Nvidia graphics cards from the Volta architecture onwards, and on consumer graphics cards starting from Ampere, the Nvidia 3000-series. With the Ada Lovelace architecture, Tensor cores are in their fourth generation, and as they have been improved each time, the latest graphics cards are the best to pick up. Then memory bandwidth comes into play, then cache configurations, and finally FLOPS. The other thing to remember is that the amount of VRAM dictates the tasks you can run, with 12GB being a minimum for image generation, and 24GB for work with transformers.
If you're only starting out getting to grips with deep learning tasks, you don't want to dive in at the deep end. That's why my recommendation for starting out is an Nvidia GeForce RTX 4070 Ti, like the MSI Gaming X Slim model. With 12GB of VRAM, it's got enough for image generation workloads, and the fourth-gen Tensor cores will chew through tasks, saving you time. For moving on to transformers or to generate images or other LLM outputs faster, I recommend any Nvidia GeForce RTX 4090 model that you can find in stock, which currently is this Gigabyte Gaming variant. With 24GB of VRAM you'll be able to use larger datasets, and the increase in Tensor cores will be noticeable. The reasons that make it such a good buy for home users is also why you can't find any in stock, as companies have been buying them in droves to power their own AI aspirations.
If money is no object, and you're making serious income from your deep learning tasks, the Nvidia H100 is the best server-class GPU you can buy as a consumer to accelerate AI tasks. With 80GB of VRAM, you can use significantly larger datasets loaded into memory, opening access to tasks that you can't achieve on desktop-class cards. And to round off your deep learning rig, you'll want to use one of the best motherboards to tie everything together. Here, you're probably going to want to look for stability and longevity, as you won't be risking overclocking, which would be disastrous if it failed part-way through training a model.
MSI GeForce RTX 4070 Ti Super Ventus 3X
- Memory
- 16GB GDDR6X
- Boost Speed
- 2640 MHz
- CUDA Cores
- 8448
- Architecture
- Ada Lovelace
- Memory bus width
- 256-bit
- MSRP
- $799
The MSI RTX 4070 Ti Super Ventus 3X is our pick for the best overall graphics card you can buy for deep learning tasks in 2024.