Tutorial pbdR : Programming with Big Data in R

Date: 17&18 February, 2014

Place: Seminar Room 5 (D313) @ Institute of Statistical Mathematics

Organizer: URA station (Institute of Statistical Mathematics)

Speaker: Professor Dr. George Ostrouchov
(Senior Research Staff Member in the Scientific Data Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee and the Joint Institute for Computational Sciences)

Program

【17 (Monday)】

10:30-12:00
Talk:Elevating R to Supercomputers with Scalable Libraries

13:30-14:25
Tutorial 1: Introduction

14:30-15:25
Tutorial 2: Introduction to MPI and its simplified use via pbdMPI

15:45-16:45
Tutorial 3: Parallel data input

【18 (Tuesday)】

10:00-10:55
Tutorial 4: Introduction to distributed matrices

11:00-11:55
Tutorial 5: Examples using distributed matrix methods

Abstract

Talk: Elevating R to Supercomputers with Scalable Libraries
The pbdR collection of packages elevates a large portion of the R programming language to very large distributed platforms using scalable HPC libraries and ease of programming infrastructure. This infrastructure can be used in combination with other parallel concepts in R to handle truly large data. We have benchmarked our tools to 50,000 cores on the University of Tennessee Kraken supercomputer, performing statistical operations on terabytes of data in seconds. This talk will give an overview of the infrastructure and of the concepts used in its development.

Tutorial: pbdR: Programming with Big Data in R

Brief Description:
The tutorial will introduce attendees to high performance computing concepts for dealing with big data using R, particularly on large distributed platforms. We will describe the use of the "programming with big data in R"
(pbdR) package ecosystem (see r-pbd.org) by presenting several examples of varying complexity. Our packages provide infrastructure to use and develop advanced parallel R scripts that scale to tens of thousands of cores on supercomputers but also provide simple parallel solutions for multicore laptops.
The packages are described in a textbook-style vignette associated with our package pbdDEMO. This tutorial will follow many of the examples presented in the document, which we continue to update.

Detailed Tutorial Outline:
* Introduction
A quick overview of R's parallel capabilities, briefly discussing some of the merits and downsides to these approaches. Additionally, we will provide a cursory summary of parallel hardware and R's place in this confusing spectrum. We then conclude this portion of the tutorial with a discussion of the major pbdR paradigms, namely batch programming and the single program/multiple data (SPMD) model.

* Introduction to MPI and its simplified use via pbdMPI.
A brief introduction to our high-level approach to MPI programming. Most of the time will be spent studying examples where we apply pbdMPI to solve common statistical problems, including Monte Carlo simulation, linear regression, and cluster analysis. This is taught with an SPMD style of parallel programming which is by far the most common programming style in supercomputing.

* Parallel data input.
Truly large data sets must be treated in parallel beginning with data input. We introduce some basic concepts on how to read data in parallel with our pbdMPI and pbdNCDF4 packages.

* Introduction to distributed matrices.
The pros and cons of using this higher abstraction. Data management issues, such as redistributing in-memory data, will be discussed at length.

* Examples using distributed matrix methods.
Revisit some of the earlier examples, such as linear regression, but also offer some new examples, such as principal components analysis and data redistribution for plotting.

Background knowledge required and potential attendees:
Basic knowledge of R, need to handle very large data, and an interest in working on large computing platforms.