Machine learning finds new ways for our data centers to save energy

The virtual world is built on physical infrastructure. Every search that gets submitted, email sent, page served, comment posted, and video loaded passes through data centers that can be larger than a football field. Those thousands of racks of humming servers use vast amounts of energy; together, all existing data centers use roughly 2% of the world’s electricity, and if left unchecked, this energy demand could grow as rapidly as Internet use. So making data centers run as efficiently as possible is a very big deal.

Thankfully, despite skyrocketing demand for computing, data center electricity use has flattened over the past few years, largely due to enormous opportunities to improve efficiency as these facilities scale up.1 But capturing these opportunities can be a very complicated process. The standard measure of energy efficiency in data centers — power usage effectiveness (PUE) — can be impacted by dozens of variables. A typical facility has many types of equipment, including chillers, cooling towers, water pumps, heat exchangers, and control systems, each with its own settings and all interacting in intricate and often counterintuitive ways. Throw in factors like air temperature and fan speed, and the system complexity becomes astronomical. Consider one simplified scenario: just 10 pieces of equipment, each with 10 settings, would have 10 to the 10th power, or 10 billion, possible configurations, a set of possibilities far beyond the ability of anyone to test for real — but far fewer than an actual data center’s possible configurations.

dc exterior
Cooling towers at a data center in Belgium

Google has been thinking about data center efficiency for as long as we’ve been thinking about data centers. Early on we decided to design and build own our facilities from the ground up, to enable us to continually pilot new cooling technologies and operations strategies. Our data centers employ advanced cooling techniques, using highly efficient evaporative cooling or outside air whenever possible instead of mechanical chillers. We reduced facility energy use by installing smart temperature and lighting controls and redesigning how power is distributed to minimize energy loss. Our high-performance servers are custom-designed to use as little energy as possible, stripped of unnecessary components like video cards, and are kept as busy as possible so we can do more with fewer servers. And so on.

The result of all these efforts: by spring 2014, Google data centers used 50% less energy than the industry average. Which of course meant the next question was whether they could run even leaner. An efficiency engineer named Jim Gao, his interest piqued by an online class on machine learning, decided to find out.

Machine learning gives computers the ability to learn things without being explicitly programmed, by teaching themselves through repetition how to interpret large amounts of data. Google already uses it to improve features like translation and image recognition. When you ask Google Photos for pictures of people hugging, it’s machine learning that finds the photos you’re after.

Water valve manifold and pressure sensors at a data center

Gao hoped it might help him better understand the blizzard of data center information by “finding the hidden story in the data.” He spent “six error-prone, head-banging months” building a proof-of-concept model of all the components in one data center. “It was super janky code,” he says, “very much a prototype, to prove that the idea was valid and worth pursuing.”

The initial results weren’t entirely promising. “The first predictions were totally off,” Gao admits. “The models didn’t do a very good job of predicting PUE or the consequences of our actions.” In fact, the model’s first recommendation for achieving maximum energy conservation was to shut down the entire facility, which, strictly speaking, wasn’t inaccurate but wasn’t particularly helpful either. “We had to force our AI to be a responsible adult, discipline itself a little bit,” Gao says. He changed variables and ran the simulations again, adjusting the model over time ever closer to the configuration that most accurately predicted — and thus was most likely to be able to improve — the facility’s actual performance. When he felt his prototype was sufficiently precise, he published a white paper and started working with the site operations team to implement the model’s recommendations in actual facilities.

Just 10 pieces of equipment, each with 10 settings, would have 10 to the 10th power, or 10 billion, possible configurations — a set of possibilities far beyond the ability of anyone to test for real.

At the same time, Google’s leading artificial intelligence research group, DeepMind, had caused a stir with a paper describing DQN, a computer agent that was really good at playing Atari games. All Atari games. It was one thing to train a program to play a particular game really well, but a program capable of teaching itself to excel at an entire range of games was something else altogether. In the machine learning community, this was mind-blowing stuff, and when Gao heard about it, he quickly sent DeepMind head Mustafa Suleyman an email with the following subject line: Machine learning + data centers = awesome?

Suleyman agreed that Gao was indeed onto something awesome, and DeepMind started working with Gao and his data center intelligence (DCIQ) team on more “robust and general” working models. Just as you don’t want one highly focused agent that can play one Atari game but a generalized intelligence that can learn all Atari games, general beats specific when it comes to data center machine learning as well. It would be relatively simple to create a custom program that models each data center, but “it would be much better,” Gao says, “if we created a general intelligence that everyone can take advantage of.”

jim gao
Jim Gao on the Google campus

So that’s what they did. Eighteen months later, the models have been piloted at multiple facilities and have produced a 40% reduction in energy used for cooling and 15% reduction in overall energy overhead. Although one of these pilots has already succeeded in bringing the PUE at one of Google’s test data centers to a new low, the growing DCIQ team believes it has only scratched the surface of machine learning’s more general applications. Google’s environmental team wants our operations to emit less carbon. Hardware ops aspires to fewer component failures. The platforms people care about server energy consumption. Machine learning can help them all achieve their efficiency dreams.

Not to mention those of the rest of the world. “We’re trying to be really open source about this,” Gao says. “We strongly believe that the work we’re doing can benefit others as well.” A second white paper, due out soon with more details about DCIQ’s work, hopefully will help other data centers lower their energy usage, and numerous other types of facilities — power plants, factories, etc. — have infrastructures that might also benefit. We hope the work DCIQ has done and will do going forward will help other companies and industries get a lot greener, in both senses of the word.

  1. 1“United States Data Center Energy Usage Report,” U.S. Department of Energy, Lawrence Berkeley National Laboratory, 2016.
Back to top