[SPONSORED CONTENT] In biomedical research it’s accelerate or perish. Drug discovery is a trial-and-error process driven by simulations – faster simulations, enabled by compute- and data-intensive technologies, mean more runs in less time, resulting in errors identified and solutions achieved.
Established in 1946, the Oklahoma Medical Research Foundation is a nonprofit research institute with more than 450 staff and over 50 labs studying cancer, heart disease, autoimmune disorders and aging-related diseases. OMRF discoveries led to the first, U.S.-approved therapy targeting sickle cell disease and the first approved treatment for neuromyelitis optica spectrum disorder, an autoimmune disease. The foundation’s research is enabled in part by advanced technology – accelerated clusters and high-performance data storage that support workloads fueled by massive data sets.
Keeping pace with advances in biomedical research requires regular updates to the foundation’s technology infrastructure, and for that, the organization since 2015 has partnered with Silicon Mechanics to harness HPC-class compute combined with bigger, faster, I/O-intensive data storage. It’s a partnership that exemplifies Silicon Mechanics’ “Expert Included” customer relationship ethic.
The latest refresh began in 2020 when it was found that rising data volumes were overtaxing the existing system, creating bottlenecks that put a drag on performance. By early last year, Silicon Mechanics and OMRF had begun working in partnership on a new research computing infrastructure delivering 10x improvement in job run times.
Among the enhancements: one job was reduced from 70 days to seven days while another common analysis workflow was reduced from 12 hours to two. The infrastructure supports multiple and simultaneous research initiatives without negatively impacting other jobs or workloads, the foundation reports.
The result: faster turn-around times for OMRF scientists accelerate the next stage of their research.
“OMRF is in a really compute- and data-intensive environment,” said a Silicon Mechanics representative. “They have to store and retrieve huge amounts of data efficiently from high-throughput lab instruments, such as NGS (Next-Gen Sequencing) and microscopy tools. And they support mixed workloads that vary greatly, from jobs with a few, very large files (100s of GB) to jobs with many, very large files, to jobs with thousands of small (< 1MB) files. They needed a high-performance storage solution that could keep up.”
When the Silicon Mechanics-OMRF team came together and reviewed their options for a new infrastructure, they initially looked at strategies based on all-flash using direct-attached or network-attached storage (DAS, NAS). But they determined there were lower costs and better performance to be found in other approaches. In the end, they chose Silicon Mechanics’ Triton Big Data Cluster HPC/AI software-defined reference architecture. Triton combines HPC-class Silicon Mechanics servers with the massively parallel Weka file system (WekaFS) from WekaIO. It has a shared pool of NVMe-over-fabric (NVMeOF) for high-speed connectivity between the host and the memory subsystem and to CPUs via the PCIe high-speed interface. Triton’s scale-up and scale-out capabilities use pre-defined storage building blocks with capacity for 425TB of data per node.
The new OMRF infrastructure uses servers powered by AMD EPYC 7002 Series CPUs while cluster networking is handled by NVIDIA 100 gigabit InfiniBand for the system interconnect.
For advanced storage, the front end of the system is all NVMe flash with a second tier of S3-compliant object storage on spinning disk from Scality. Data can be migrated between the two tiers for cost-effective storage flexibility.
“You can have anywhere from a few hundred terabytes up to multiple petabytes of all NVMe flash for immediate access and processing, and then you can store your colder data on an S3 object-compliant data store,” said the Silicon Mechanics representative. “They can move data between tiers if they need to re-analyze data or run new jobs on that data.”
OMRF pulls data from a number of lab instruments, devices that can stream images and video at very high rates. “Because of that, they needed to capture all of that data without losing anything and they needed to support a multitude of devices simultaneously. But OMRF added more equipment, the old storage system wasn’t able to keep up with the increasing data.”
Which meant that the hot (flash) tier of OMRF’ new system required accelerated processing.
This is where Weka storage came into play. Their hot tier can interface with GPUs and the NVMe storage at a very high level. And they’re able to stream the data from the NVMe devices to the compute for analysis at a much higher rate than some other storage arrays.
Meanwhile, data not in immediate demand can be stored on the AWS S3-compliant disks, a less expensive storage medium than flash.
Silicon Mechanics has been working with OMRF for years on various HPC systems, so they knew the massively parallel architecture of the WekaFS and its tiering to object storage would be a great fit for the foundation. Silicon Mechanics’ ability to design and build a fantastic HPC compute infrastructure and combine it with an integrated storage component like WekaIO’s is how they met OMRF’s needs. It leverages storage technology with tremendous I/O throughput, perfect for the foundation’s workload requirements.
An advantage of the Triton reference architecture is that it has been pre-engineered, tested and optimized before installation, enabling more time and focus to be spent on customizing the design for specific workload and organizational needs.
In addition, having the WekaFS within the Triton architecture means research workflows are simplified by eliminating staging-in and staging-out data into a compute node’s local SSD, a complex and slow process. In effect, research outcomes are no longer hampered by a node’s limited local SSD storage capacity. The WekaFS acts as a front-end, enabling quick access to the object archive tier so that applications have access to all the data.
“OMRF scientists no longer have to deal with the storage management and life cycle,” said Silicon Mechanics. “Their old workflow required staging data to a compute node’s local SSD for better application performance. Now they have enough performance and expandable capacity, they don’t have to think about it.”
At the heart of this powerful infrastructure, of course, are the microprocessors powering Silicon Mechanics hardware running OMRF research workloads. Silicon Mechanics selected AMD EPYC 7002 Series CPUs because they offer several key advances impacting how HPC workloads are processed.
“These AMD processors use the new fourth generation PCIe technology, which supports a 16 GT/s bit rate,” said Silicon Mechanics. “AMD has more PCIe slots and more PCIe lanes, which can really help in building a balanced system. This gives you more flexibility for architectural options, you can add lots of high bandwidth devices that solution architects like me used to only dream about.
A key to the success of the new research infrastructure is the joint experience of OMRF working with Silicon Mechanics, which is well-versed with the foundation’s unique workloads and organizational requirements.
“Having a trusted advisor and a partner that they will take the time to look at what they need, that’s what it’s all about,” said the representative. “They only get this money once every year-and-a-half or so. It’s critical they spend it correctly. Together, we can look at a lot of options, and they put a lot of trust in us to give them good advice.”
For customer support on day-to-day matters, “it really is the fact that they can pick up the phone and talk to people with decades of experience in HPC without going through a phone tree,” the representative said. “OMRF can talk with technical people who have worked with them for more than five years. This isn’t a place where reps come and go. In this industry, you’d be surprised how many people will trade emails back and forth to death versus picking up the phone and talking through the problem with a customer. A lot of times, customers just want to talk to somebody instead of trading emails. It’s really that simple.”