In 2007, quantitative ecologist Karthik Ram sought to find out why certain insect parasites appeared in some sand dunes but not others. Ram, who was a graduate student at the time, thought that asking scientists for field data used in the papers they published was no big deal. But the scientists he e-mailed ignored his requests, so Ram, then at the University of California (UC), Davis, had to collect extra insect samples.
Later, as he studied how climate change was impacting vegetative growth as a postdoc at UC Santa Cruz, Ram found that colleagues weren’t willing to hand over the raw measurements behind published data, or the algorithms that supported the authors’ conclusions. So, Ram spent a year reproducing the data sets so he could use them in his analyses. "I did it all myself, even though I knew that others had done this work before. I personally felt a little bit cheated," Ram says. "Aren't research papers meant to be recipes, allowing colleagues to reproduce their conclusions? But usually they’re not. And nobody thought about publishing the code they used at the end of their paper."
Seven years on, the practice of science is becoming more open, and a culture of sharing preprints, data sets, and scientific code is spreading. Ram—one of the pioneers—is prodding and enabling that shift. In 2011, he and his colleagues created rOpenSci, a platform and repository that boasts dozens of open-source data-and-analysis packages serving fields ranging from climate science to vertebrate biology via human genetics. Today, the Alfred P. Sloan Foundation awarded the project, which operates out of U.C. Berkeley, a second round of funding, bringing its total funding to $481,000. rOpenSci is one of a growing community of tools—Dryad, Mendeley, figshare, GitHub, and arXiv are others—that help scientists more easily share data and other resources. “We’re trying to bring the culture across disciplines and lower the bar to sharing,” says Ram, today an assistant researcher at Berkeley. “More and more people are seeing the value in sharing their data.”
An evolving culture
For more than a century, the peer-reviewed paper has been the main way that scientists share their work. But in the 1980s and 1990s, respectively, open-science adopters started sharing work via preprint servers: Research Papers in Economics—RePEc—for economists and then arXiv for physicists. As part of the Human Genome Project, the government required researchers to make their genomic data and related code available freely. In other fields, though, sharing drafts of papers—or for that matter, data or code—hasn’t really caught on.
The World Wide Web’s power to host myriad collaborative tools—GitHub, a code-sharing repository platform used by millions of software engineers is a leading example—has inspired scientific societies, journal publishers, and universities to apply steady pressure for science that’s more open. The National Science Foundation (NSF), the National Institutes of Health, and other agencies in the United States and abroad have data-sharing requirements. Enforcement, though, is spotty.
Scientific publisher the Public Library of Science (PLOS) became a leader in the movement when it announced a new policy in February requiring authors in all its journals to archive the raw data used in PLOS papers. “Data availability allows validation, replication, reanalysis, new analysis, reinterpretation, or inclusion into meta-analyses, and facilitates reproducibility of research,” PLOS editors wrote, adding that sharing would provide "better ‘bang for the buck’ ” from scientific research.
In frequent talks around the country, Ram hears a lot of skepticism from scientists, he admits. Scientists generally believe that sharing is a good idea in principle, he finds, but in practice many are reluctant. After the PLOS announcement, for example, a firestorm erupted on Twitter under the hashtag #plosfail. The policy would “radically change the way [researchers] do science, at great cost of personnel time,” wrote DrugMonkey, an anonymous blogger. Neuroscientist Erin McKiernan, an independent neuroscientist working outside Mexico City, explained that in that country “data acquired are like gold, and it is absolutely crucial that researchers here get as many publications out of one data set as possible.”
Biologist Terry McGlynn at California State University, Dominguez Hills in Carson fears other scientists might use data he posts online and not collaborate with him. Once sharing data sets “gets as much recognition and credit [as papers] in the academic sphere,” he says, “I’d be a lot more interested in sharing.”
Incentives for sharing
What, then, are the incentives for scientists to share? Hiring and tenure committees, in their deliberations over the fate of faculty, still focus by and large on traditional metrics: publications in high-impact journals, citations, and grants. They continue to generally undervalue shared data sets, methods, and analytical tools—despite the fact that such work is “wickedly important” to the scientific enterprise, as Tom Daniel, former chairman of the Department of Biology at the University of Washington in Seattle, puts it in an e-mail to Science Careers. His university, he says in an interview, is “discussing” rewarding such contributions, but he describes those discussions as “a work in progress.”
Yet, some scientists say that sharing data has paid big professional dividends. Genomicist Casey Bergman, of The University of Manchester in the United Kingdom, says that ever since his postdoc days, his career has “definitely benefited” from shared genomic data and software. So now that he is a faculty scientist, he has made a point of sharing data and resources. After his group utilized a new gene-sequencing tool, they posted some unpublished genome data online. Biotech firm Pacific Biosciences got in touch. The result was a new collaboration and what Bergman calls “an amazing genomics data set” on fruit flies. “Small groups can benefit from embracing open data just as big consortia have in the past,” he says.
In 2012, bioinformaticist C. Titus Brown, of Michigan State University (MSU) in East Lansing, posted a draft paper on a preprint server describing a new sequence-analysis technique. Since then, hundreds of scientists have used it. Even though it isn’t formally published, the technique has been cited in 15 peer-reviewed papers. Brown believes that this and other influential software he has developed, which have led to grants and job offers, should help him get tenure. “My career has developed in large part because I’ve been open about everything,” he writes in an e-mail to Science Careers.
Ecologist Ethan White of Utah State University, Logan believes that sharing can help you win grants because “funding agencies want to know that the money they spend will benefit science as a whole.” Adds Ram: “You collected data via publicly funded work; it’s not yours to hoard forever.”
One of the two top complaints about sharing data is that it takes a lot of time—and indeed it often does. So, how can scientists start sharing their data without too much extra effort? Ram and other sharing gurus have plenty of tips and exhortations.
Think about sharing before you begin collecting data. Metadata—the background info on the data you’ve collected—is crucial for colleagues to be able to use your data for new research. But creating it is difficult if you wait to do it until long after your experiments or fieldwork, White says. Ram urges scientists to prepare to share from a project’s first day. It need not be drudgery, he says, noting that rOpenSci has a tool that can create automated workflows to continually update metadata for ecology.
Don’t worry about getting scooped. That’s a very low risk, says evolutionary genomicist Ian Dworkin at MSU. “Young scientists might be worried about releasing data into the wild on a database, concerned that somebody might beat [them] to the punch,” he says. It might happen, but it would be “a fringe case,” he adds.
Don’t post your data on your personal website. You may change institutions, or the site could be shuttered in the future, Ram says. If that happens, the data could be lost, or you might need to start over again. Your data will be safer—and more visible—if you file it with similar data in a place where it can easily be searched for and found. Many universities offer data repository services (check your institution’s library) and hundreds of research repositories have been established encompassing many fields. Dryad and figshare are popular repositories that span many fields.
►Don’t try to create your own license, White says. Standard licenses, like one from Creative Commons, can help you avoid the ambiguity created when researchers try to write their own sharing agreements.
►Use common, appropriate tools. rOpenSci was built for scientists motivated to share but who don’t know how, Ram says. It’s built around data-and-analysis packages that use R, an open-source statistical analysis language that uses brief lines of written code instead of a series of pull-down menus, allowing for easier sharing and iteration. Learning R, and how to follow rOpenSci’s myriad recipes, requires about a day of training, Ram says. Some packages include actual data; others enable searching of data sets posted on the site, or in full-text journals or other repositories.
►Sharing preprints can prevent embarrassing errors from getting published, Ram says. Posting code you used to generate your figures can allow colleagues to check your work and improve it. “Better to have colleagues catch your mistakes before they appear in a journal,” he says.
►Your work includes products, not just publications. As NSF put it in 2012: “products may include, but are not limited to, publications, data sets, software, patents, and copyrights.” When you post a tool online and someone uses it to generate new results, that’s a citation. Be sure to list it on your CV.
►Link your data, methods, and papers online. Someday, Ram hopes, it will be standard practice for a scientist’s full work on a project—the data, the metadata, the code, and the paper itself—to sit together online, freely and indefinitely.
►Just get started, says Brown. White has written a guide called “Nine simple ways to make it easier to (re)use your data.”
As for his own career, Ram says he’s glad he has become a sharing evangelist. He enjoys helping a wide range of scientists share—and discover—data online. The new funding from the Sloan Foundation will allow him to add a second colleague to rOpenSci’s paid team and to create new data and analysis packages to reach new fields, including several social science fields. “I thought I would follow the script [and] work towards a tenured position in academia,” he says. ‘“I never planned my career to take this path.”