Skip to end of metadata
Go to start of metadata

Handling Large Data with R

The following experiments are inspired from this excellent presentation by Ryan Rosario: http://statistics.org.il/wp-content/uploads/2010/04/Big_Memory%20V0.pdf. R presents many I/O functions to the users for reading/writing data such as 'read.table' , 'write.table' -> http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files. With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.

From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O

Testing bigmemory package
Test Background & Motivation

R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.

Testing tools

Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.

ii. Write and read a large matrix using read.table and write.table.

iii. Write and read a large matrix using bigmemory package

iv. Testing using my.read.lines

Test Results

Platform Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

utils

Total Elapsed Time(sec)

bigmemory

Total Elapsed Time(sec)

File size on disk (.csv)

Computation Time Saved by bigmemory

write.table

369.79

big.matrix

1.51

1.7GB MB

99%

read.csv

313.03

read.big.matrix

141.50

1.7GB

55%

my.read.lines(filepath) took 23.73 secs

*Test Discussion

The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters 'backingfile' and 'descriptorfile'. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix('BigMem.desc'). This way several R processes can share memory objects via 'call by reference'.
The .desc file is an S4 type object -> https://github.com/hadley/devtools/wiki/S4
Advantages:
i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using 'call by reference'



Testing Higher Order Functions
Test Background & Motivation

Does applying apply functions instead of looping constructs save computation time? In this test the for loop is tested against 'apply' functions.

Testing tools
Test Scenario

i . Set up a large list with some fake data, Sims, containing 100000 5*5 matrices.

ii. Now, we define four functions that extract the fifth column of each matrix in Sims and create a 100000* 5 matrix of results.

Test Results

Platform Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

Function

Total Elapsed Time(sec)

24.22

37.55

52.92

63.34


Myth Busting

This article presents some clever workaround to speed up R computations ->http://www.r-bloggers.com/speeding-up-r-computations/. The tests here are based on verifying if the statements are myth or true.
Platform Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

Myth 1:

Method

Elapsed total time to compute (secs)

129.75

60.77

60.72

This was verified to be correct. The mean method is slow.

Myth 2:

x <- rnorm(50000)

Method

Elapsed total time to compute (secs)

36

33.67

As N increases the 'var' method performs equivalently well. Applying the equation does not speed up things

Myth 3:
Pre-allocating a vector is faster than assigning 'NA'.

Method

Elapsed total time to compute (secs)

31.08

23.61 (24% Faster)

This was verified to be correct.

Labels
  • None