2013 has been an exciting year for the field of Statistics and Big Data, with the release of the new R version 3.0.0. We discuss a few topics in this area, providing toy examples and supporting code for configuring and using Amazon’s EC2 Computing Cloud. There are other ways to get the job done, of course. But we found it helpful to build the infrastructure on Amazon from scratch, and hope others might find it useful, too.
The term “recent advances” should be placed in context. Most of the fundamental computer science research beneath the technologies discussed here took place long ago. Still, innovation and software development of specific interest to statisticians and data scientists is one of the most important and impactful areas of R&D today. Let’s say it together: “Yes, we are sexy!”
This note offers a high-level introduction to some of the recent changes of the R software environment (R Core Team, 2013b) as of versions 3.0.0 relating to high-performance computing. Specifically, updated indexing of vectors addresses a substantial size limitation on native R objects under versions 2.15.3. Native R objects are still limited to available memory (RAM), however, and many Big Data problems demand memory exceeding RAM on even the best-equipped modern hardware. To help address this problem, we very briefly discuss package bigmemory (Kane and Emerson, 2013). Finally, we present the elegant framework for parallel computing using package foreach (Weston and Revolution Analytics, 2012). Toy code examples are provided and were run on Amazon’s Elastic Compute Cloud (EC2) running Ubuntu Linux. This isn’t necessary, of course, so why do it? Because EC2 is relatively easy to use and scalable. Within a matter of minutes, anyone can request and create a cluster of instances that communicate with each other with low latency. A basic “how-to” is provided as supplementary material available online.
All information that you supply is protected by our privacy policy. By submitting your information you agree to our Terms of Use.
* All fields required.