If data driven discovery becomes the norm, more scientists will need to upgrade from their desktop computers to more powerful, scalable computing systems. As director of research–physical sciences at the eScience institute at the University of Washington (UW), it's Jeffrey Gardner's job to help researchers with that migration.
In addition to being a facilitator of computational work, Gardner is a computational astrophysicist. He has run code that utilized all 100,000-plus computer processing unit (CPU) cores and 10,000-plus hard drives of the supercomputer Kraken at the National Supercomputing Center at the University of Tennessee Knoxville. He works part-time at Google, as a visiting scientist. Before joining UW, he was senior scientific specialist at the Pittsburgh Supercomputing Center. So, he knows about resources for scientific computation.
It's not that computational resources are hard to come by; they are available from a variety of sources. In fact, in addition to his other duties, Gardner is UWs campus ambassador for the National Science Foundation's Extreme Science and Engineering Discovery Environment (XSEDE) program, which for 25 years has made computation and storage platforms available, free-of-charge to academic researchers in the United States with high-performance computing (HPC) needs. "I have been shouting FREE COMPUTING TIME from the rooftops for about 5 years now," he writes by e-mail. "By funding a dozen or so sites across the country, NSF ensured that every researcher gets the same access to the resources no matter where they are located."
The Department of Energy and NASA also operate high-performance computing facilities, which are available to researchers whose projects are funded by those agencies. And at most top universities and research institutes, scientists can access high-performance computing clusters on campus, usually for a fee.
Today there's yet another new player on the scalable-computing scene: the cloud. At far-flung data centers, "elastic" clusters of computational capacity can be assembled on-demand. This is possible because companies have made commercial cloud platforms—Amazon Web Services, Windows Azure, Google Compute Engine, and such—available to scientists. The primary appeal of this approach is its relatively low cost, made possible because the clusters are made up of commodity hardware and software—readily available computing components—says Joseph Hellerstein, manager of computational discovery for science at Google.
But there are other, practical advantages. Like the other computing resources available to scientists, scientific cloud computing has a niche.
CREDIT: Joseph Hellerstein/Google
To demonstrate the research possibilities of commercial clusters to scientists working in academia, in late 2011 Cycle Computing, announced the BigScience Challenge, a competition that sought "the runts, the misfits, the crazy ideas that are normally too big or too expensive to ask, but might, just might, help humanity," according to the company's Web site. Jason Stowe, the company's CEO, says the goal of the competition is to allow scientists to think big in framing research questions, unconstrained by the availability of computational resources.
Victor Ruotti, who in 2011 was a computational biologist at the Morgridge Institute for Research at the University of Wisconsin-Madison, wanted to scrutinize gene expression profiles of tissue samples to find the genes involved in the differentiation of human embryonic stem cells. The results could help clinical researchers uncover treatments for certain diseases. But it would take 115 years to complete such a project on a single computer core.
Ruotti's run used a virtual cluster of 5000 cores on average, 8000 at peak. It accessed 78 terabytes of genomic data and took a week to complete. As the winner of the Big Science Challenge, Ruotti didn't pay anything, but if he had been paying his way this dream run would have cost close to $20,000.
Ruotti could have made use of his university's computing grid, made up of nearly 10,000 cores at the Center for High Throughput Computing. "But to finish this same workload in a week, other users' work would have had to stop entirely, which wasn't practical," Stowe says. Cycle Computing deployed a secure, on-demand cluster located at Amazon Web Services, exclusively for Ruotti, orchestrating the scheduling, data encryption, and technical details. The researcher needed only to bring the genomic data analysis software—and the data.
Cycle Computing isn't the only player on the scene. Last year, as part of its Exacycle project, Google awarded more than 100 million free core-hours to six big science projects. The Exacycle scheduling infrastructure locates idle CPU cores in Google datacenters and uses them to run scientific code. Gardner is one of the lead scientists on one of those projects, the only non-life science project among the chosen six.
Another Exacycle user is Kai Kohlhoff, a research scientist at Google. Kohlhoff 's Exacycle project runs dynamic simulations of a class of compounds vital to drug therapies. "Simulations of such larger molecular systems are typically done on a supercomputer like Anton, a distributed computing project or a volunteer 'cloud' like Folding@home," he says. But, "Even for leading labs, it is difficult to secure a thousand cores or more to use for several months for an individual project at their own institution," he says. With Folding@home, they would have generated a much smaller data set, permitting far fewer insights.
Economics versus ease of use
Cloud computing usually isn't free—making it more expensive than NSF's free XSEDE program. But, Gardner says, the cloud approach gives researchers things they value. To grant time on its computers, NSF requires a lengthy proposal, about 10 pages, which is reviewed by a panel of experts. Writing a good proposal takes months. "For the cloud, on the other hand, all you need is a credit card and off you go," says Gardner, who helps researchers at UW write proposals.
There could be other reasons to choose the costlier option, Gardner points out. High-performance computing systems can be pretty bare bones; commercial clouds may offer better top-layer interfaces. There may be raw capabilities that the researcher needs that are easier to acquire in the cloud, such as access to a database system. At national facilities, researchers have to submit their jobs into a batch queue and wait for resources to become available; with cloud platforms, researchers can get results much faster.
University HPC centers, like Hyak at UW, are not free either, but they, too, are critical pieces of the scientific-computing puzzle. "Let's say you want to run a quick 15-minute test to see if you got rid of a bug. It sucks to have to put your job in the queue and wait, say, 24 hours for it to run," Gardner says. Or, a researcher may need only a few nodes on a cluster—not enough to warrant the use of a Kraken or an Anton. In that case, the local HPC center may be just the thing.
Though ever-larger clusters can be assembled in the cloud with relative ease, cloud platforms are a poor fit for some scientific problems. The problems that work well in the cloud can be run with a high degree of parallelism, without the need for rapid communication between components of the cluster, Google's Hellerstein says. Because they lack high-speed interconnections, clusters are not ideal for, for example, simulating the human brain, where neurons constantly communicate with each other. But workarounds may be possible, he adds, so perhaps these too may some day run on clusters.
Data in the cloud
As data acquisition becomes simpler, data-storage questions become more pressing. While computer memory is cheap, security and archival issues make the choice of a storage medium important. Cloud platforms, ephemeral though they may seem, offer long-term data storage capabilities. Still, why use the cloud when you can store the data—and back it up—in your own laboratory?
Cloud platforms offer the promise of universal, open access by other researchers. Shared code has similar advantages, allowing scientists to build more rapidly on peers' results, Hellerstein says. He points to the example of the nonprofit Sage Bionetworks, whose mission is to make biomedical research open by convincing researchers to pool genomic and biomedical data in a huge, well-curated database in the cloud. Another example is the National Institutes of Health's online repository of genomic data, GenBank. In some scientific disciplines, like particle physics and astrophysics, the practice of sharing experimental data is more widespread than in other fields.
"The important challenge is doing something more than simply storing the data," says John Quackenbush, who is a professor of computational biology and bioinformatics at Harvard University's Dana-Farber Cancer Institute. He builds integrated databases that bring together distinct but complementary types of data that are relevant to the treatment of cancer. "We need to develop tools and protocols to make the data accessible, useable, and useful in addressing relevant biological questions."
In the long run, this could be the biggest appeal of doing science in the cloud, Hellerstein speculates: the acceleration of scientific discovery via promotion of data sharing and collaboration.