BUSINESS
Life sciences research steps up to the STORAGE CHALLENGE RICK MULLIN, C&EN NEW YORK CITY
INFORMATION TECHNOLOGY is a field that thrives on creating its own language. It’s a lingo of clichés and broad generalities in which purveyors of new IT for science and business routinely speak of their products—whether hardware, software, or services—as “solutions” for a particular “space.” Despite the pressure to come up with new iterations of computing products, little happens to refresh the terms, such as a “paradigm shift,” that surround them. Now and then, however, something new or unusual comes along. Witness “elephant flow,” which describes around-the-clock transmission of huge quantities—terabytes—of data. Elephant flow is endemic to the financial and social media sectors. It also has become a force in health care, particularly in drug discovery, where digital laboratory instruments automatically generate enormous volumes of data. Much, but not all, of these data are crucial to the discovery and development of new drugs. The stampede of information did not show up overnight, certainly, and large research labs have been wrestling for years with the technical and work process challenges of storing, analyzing, and sharing data. Hardware and software vendors also have been hard at work designing products that can handle an indefinite volume of data while providing analysis and storage maintenance. But as the volume of life sciences data reaches elephantine proportions, change is accelerating. Labs are investigating, and investing in, technology that may fundamentally alter both their protocols for handling data and their core research practices. At trade shows and in interviews, data managers across the life sciences sector say they are itching to convert basic, if quite large, data storage to data networks that combine storage and analytical intelligence. Some seek to decentralize data monitoring by distrib-
uting analytical intelligence on a network capable of indefinite expansion. Others are contemplating a shift from file-based storage to a more Web-surfable technique called object storage that categorizes data according to tagged attributes. Meanwhile, decisions are being made on achieving an optimum balance between data stored onsite and in the cloud. “The real driver for doing all this nasty slog work is that the cost of storage is such that we can no longer provide scientists with infinite pools of resources,” says Chris Dagdigian, cofounder of BioTeam, a Middleton, Mass.-based IT consultancy. “It used to be that rather than understanding what’s going on, it was far cheaper and operationally easier just to make the storage bigger and not worry about management.” This is no longer working, despite the steady drop in the price of storage, because of the volume of data involved. George Vacek, business director for life sciences at DataDirect Networks (DDN), a data storage technology supplier in Santa Clara, Calif., agrees, adding that his firm has had “dozens” of discussions over the summer with pharmaceutical companies
C E N. ACS.ORG
23
S E PTE M BE R 14, 20 1 5
OTHER SYSTEM and service vendors also see the life sciences sector at a tipping point. For instance, the scientific publisher Elsevier has developed a business in consulting and data management. Timothy Hoctor, the firm’s vice president of professional services for life sciences research, says the focus in the sector is shifting from accommodation to analysis. Data storage is simple, he says. “You can take a database and dump things into it.” Accessing and analyzing data from huge stores, however, is becoming a serious crux. “What is changing now is the vastness of the data and the differentiated data types that are coming in,” Hoctor says. Genomics data, for example, reside in conjunction with data on patients involved in clinical trials. “Not only are more and more data generated, but there are more and more potential uses for those data,” he says. Vendors are beginning to place greater emphasis on data analysis in designing large-volume storage. Seattle-based Qumulo has introduced a system that takes distributed analysis software across a network of file storage modules. The founders of the three-year-old company were credited with innovating a technique called scale-out network attached storage (NAS) at a previous company, Isilon, which is now owned by the data storage firm EMC. Scale-out NAS is composed of distributed data storage clusters that share an analysis core. Scale-out NAS is currently the state of the art in file-based data storage, a practice that traditionally employed a stack of data storage units to which new capacity couldn’t easily be added. “They decided to get the band back together,” says Brett Goodwin, vice president of marketing at Qumulo. The com-
SHUTTERSTOCK
DEALING WITH THE DATA STAMPEDE
looking for greater capacity and control in data storage. The technology to make these changes has been available for some time, he says, and resistance to deploying it is now breaking down. “The sector is hitting the transition point where companies feel like deer caught in the headlights as they see this data volume coming down on them,” Vacek says. And those data are not coming only from genomics research, which is recognized as a key culprit. Vacek notes that high-resolution microscopy and imaging systems, some allowing three-dimensional monitoring over long periods of time, are generating data that take up huge amounts of storage space.
BUSINESS
mulo will give researchers “a lot more metrics about where files are, how they are accessed, and who is accessing them, as well as which files have not been touched in a long time. There are a lot of people in the industry, myself included, who simply on the basis of the pedigree of the founders of Qumulo are paying very close attention.” THE UNIVERSITY OF UTAH Scientific
Computing & Imaging (SCI) Institute, a greenhouse for research imaging software development, is preparing to convert its ample data storage to the Qumulo system. “The system we are using now has been great,” says Nick Rathke, assistant director of IT at the SCI Institute. “We have had it for four or five years with virtually no downtime. The problem is that we don’t
U OF UTAH
pany developed what it calls the first “data aware” scale-out NAS product, the Qumulo Scalable File System. Goodwin says the product is available as software that can be installed at each data storage cluster, networking both storage capacity and local data analysis. The net result, he claims, is improved data management and lowercost storage. Often with standard scale-out NAS, “the researcher calls the storage manager and says that storage is running slow and asks why,” Goodwin says. “Traditionally, the storage administrator had no ability to answer that question. It could take days to resolve.” Qumulo’s system will allow data managers to identify the process that is slowing the network and determine whether it needs to be running.
DISCIPLINE The
University of Utah Scientific Computing & Imaging Institute’s Rathke seeks to keep better track of data.
Traditional scale-out NAS has advanced as well with products such as InsightIQ, a central data intelligence system from EMC. The product provides a central window into all of the data storage in a scaled-out cluster. Goodwin says Qumulo is working on its next move—a cloud-compatible version of its product—noting that most large research enterprises will have data networks that coordinate in-house and external cloud storage. BioTeam’s Dagdigian says the data-aware approach to file storage developed by Qu-
have real-time analytics. Also, you have to buy storage in such large chunks that it’s a real financial bottleneck for us.” Adding smaller increments of storage using the Qumulo system will reduce the cost of storage capacity from about $700 per terabyte to about $400. And the SCI Institute will certainly be adding storage capacity. The center, a federally funded software development institute, focuses on medical imaging and currently provides 32 different software packages. It is staffed by a handful
of professional developers and hundreds of grad students at various departments at the university, who, Rathke explains, get their Ph.D.s and leave their code behind for the SCI Institute to support and maintain. The institute’s data are mounting in tandem with the advances in digital research imaging instrumentation. “Part of our storage problem now is that our faculty members are doing bigger and bigger projects,” Rathke says. “Data sets that a few years ago were a few hundred gigabytes are now a terabyte or tons of terabytes of data.” One source of the blowup is higher-resolution scanning. “We had one researcher here who did her Ph.D. three or four years ago tracing neurons across a rabbit retina. The rabbit retina set was in the 14–20 terabyte range,” Rathke says. Under new researchers, “that same data set has grown to well over 40 terabytes because they keep adding in the slices, and the tissue samples keep getting thinner and thinner.” As a federally funded institute, the SCI Institute faces budget as well as storage space limitations, Rathke notes. And cloud storage is not a cost-effective option given the high level of engagement with the data in the labs of the SCI Institute. A more disciplined approach to managing data is thus required. “Five or 10 years ago, storage was a big black box,” Rathke says. “We’d say project X, you get X amount of storage, and have at it. Now you have to manage that storage because the projects are getting bigger and bigger and bigger.” The traditional routine of removing fiveyear-old data no longer makes room for the data payload of new research. The SCI Institute is looking to better monitor its data usage with a strategy for adding new storage capacity. “Our faculty says this is where Qumulo is important to us and is likely to be more and more important to us in the future,” Rathke says. At the Broad Institute, advances in genomics sequencing have likewise created strains on data storage and management. “Seven or eight years ago, DNA sequencing experienced two simultaneous changes that each increased the volume of data we were seeing by three orders of magnitude,” says
“The sector is hitting the transition point where companies feel like deer caught in the headlights as they see this data volume coming down on them.” C E N. ACS.ORG
24
S E PTE M BE R 14, 20 1 5
BROA D IN ST IT U T E
Custom Robotics Affordable solutions for your unique research
focuses on how to use the technology rather than which technology to use. Determining what can be stored in the cloud, for example, is a key consideration. Christianson says a primary determinant is how actively researchers will be accessing files. Data generated from lab instrumentation that is likely to be PROTOCOL The investigated right away Broad Institute’s Dwan emphasizes are best stored interorganization in nally, she says. External managing lab data. data and files shared in collaborations are good candidates for cloud storage. Christopher Dwan, acting IT director at the Christianson emphasizes that pharmainstitute. “It became 1,000-fold faster to ceutical IT departments, though taxed, generate DNA data. It also became 1,000are not in crisis mode over data storage. “I fold cheaper by the base pair.” think our systems are continually evolving The technology available to store and and have been,” she says. The relationship manage the data has also advanced. “The between data system management and IT blocking and tackling of making a file system oversight at the lab level is also evolving. that can store a petabyte is kind of solved,” “Our scientists are a lot more technoDwan says. “But we have to innovate a lot in logically savvy these days than they were what we keep and how we keep it.” five or 10 years ago,” she says, “and our IT professionals, the folks well versed in the WHAT KEEPS DWAN up at night is data ortechnical aspects of data storage and highganization. “I have a standing joke for when performance computing, are much more people come to me and ask how to spend scientifically aware and business aware.” less money storing data. I say store less data. Others agree that the relationship beWe all laugh, but then I ask, ‘Do you know tween data and research IT management is what you’re storing? Do you know what all is key. “You definitely need to understand the in there, and do you really need it?’ ” domain to implement and manage a data Anastasia Christianson, the head of IT for management policy,” BioTeam’s Dagdigian translational R&D at Bristol-Myers Squibb, says. says her main priority is to ensure that sci“It doesn’t matter if the data manager entists are able to do what they need in the comes from an IT or science background beeasiest and fastest way possible. Mohamcause the data management can be learned,” mad Shaikh, director of scientific computDagdigian says. “What is essential is that the ing services at BMS, adds that he is especialauthority for data management stays with ly focused on speed. Like their counterparts the scientists. In my world it is completely at other major drug companies, they both inappropriate for an IT person to determine face a huge ramp-up in data as they coordiwhere to store a piece of scientific informanate efforts to deliver a storage and analysis tion, how to store it, or when to delete it.” infrastructure for BMS’s research labs. As the life sciences research world takes And like other large pharma companies, on the data elephant in the room, Vacek BMS is at a “turning point,” Shaikh says, of DDN says the relationship between lab with data generated internally and exterand data managers is critical to success. nally at an increasing rate. “I think it’s true that the best research is “The velocity of those data is such that done by organizations that have people it is not possible to host it internally,” he who are strong on both the science and the says. “We are looking at many options, and IT,” he says. “But if you can extend your cloud storage is the most promising.” The capabilities in IT infrastructure, you can company, which was one of the first users solve problems you weren’t able to solve of the Isilon scale-out NAS, has investibefore, and that puts you in the lead on the gated the Qumulo system, Shaikh says. research side.” ◾ But much of the work at BMS’s data center C E N. ACS.ORG
25
S E PTE M BE R 14, 20 1 5
You own the code!! J-KEM provides the original source code.
· · · ·
Weighing, Dispensing & Synthesis Reformatting, Capping & Mixing Intellegent Fraction Collectors Custom robotics: $10,000 - $40,000
Kugelrohr Distillation
heat-sensitive compounds ·· Distill Motorized stir speed and angle 2 L distillation fasks ·· 10Hotmlairtodistillation to 230 C o
Precision Temperature Controllers
· · · ·
0.1o regulation of any volume from 10ul to 100L