Google AI Infrastructure PM on New TPUs, Liquid Cooling and More
At its Cloud Next 25 conference earlier this year, Google launched Ironwood, its latest custom Tensor Processing Unit (TPU) AI accelerator, which easily outperforms any of its previous-generation chips. To talk about Ironwood, as well as how Google thinks about using GPUs versus TPUs, building hardware for models that are changing at an ever increasing speed, getting data centers ready for next-gen chips and more, I sat down with Chelsie Czop, a senior product manager for Google’s AI Infrastructure.
The latest generation of Ironwood pods, with 9,216 chips per pod, provides a total compute power of 42.5 exaflops, Google says. It also offers a 2x improvement in performance per watt compared to the last generation of TPUs.
As Czop noted, building these chips is always a tradeoff.
“To be able to design these systems, too, it’s interesting because you go back to the constraints that you have: it’s power, it’s thermal — being able to cool it [because the] more power you bring in, the hotter it gets — and then being able to interconnect all these chips together,” she explained. “So it comes incrementally, and then you look at it, and you look back through the generations, and you realize how far you’ve been able to come and how much that leap has been from the beginning.”
As far as the thermal improvements, Google started using liquid cooling quite a few years ago, driven largely by the need to keep its early TPUs cool. The Ironwood TPUs use Google’s fourth generation of liquid cooling systems, Czop said, though she also noted that not every TPU generation used liquid cooling.
“Just watching the evolution as to how Google’s been able to evolve the liquid cooling every single generation, it’s different when you and I talk about it, but then, you go into the data center and you see the little changes,” she said. “We run the liquid cooling pipes on the outside and the front of the systems when you’re walking down the row. And one of the reasons we do that is so that you can visibly see if there’s a leak, and from one generation to another, there’s, like, a spigot that’s pointed up and one that’s pointed down. I’m sure there were some lessons learned with that.”
With these TPUs now being so powerful, one question Czop gets a lot from customers is whether to use TPUs or (mostly NVIDIA) GPUs for their workloads. She noted that it always depends on the customer’s workload, use case and what their teams are already using. At times, she noted, teams may need an NVIDIA framework to speed up their work, which isn’t available for TPUs, for example. But for a lot of businesses, it’s also not an either/or discussion.
“We’ve had customers that go from CPUs directly to TPUs. I was speaking with Moloco in a session earlier, and they had a 10x improvement just porting their training applications from CPUs over to TPUs. They have very embedding-heavy models, so they didn’t even optimize for how they could use the sparse cores that are in TPUs to be able to do that. But at the same time, they still use GPUs as well,” Czop said.
Yet while the hardware keeps improving on an annual cadence, models — and model architectures — keep changing significantly faster. Czop noted that the teams’ relationship with DeepMind helps it look ahead.
“It’s kind of funny to me when we’re writing our announcement blogs, because we’re like, we’re designing this hardware for the next generation, and we’re not even necessarily sure what those new model architectures are going to be,” she said. “And so especially now, we’re focused on think time compute, bringing training into the inference and thinking as you’re doing the inferencing. And that is right now on the bleeding edge. But that could change next week.”