Skip to end of metadata
Go to start of metadata

Handling Large Data with R

The following experiments are inspired from this excellent presentation by Ryan Rosario: R presents many I/O functions to the users for reading/writing data such as 'read.table' , 'write.table' -> With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.

From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O

Testing bigmemory package
Test Background & Motivation

R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.

Testing tools

Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.

ii. Write and read a large matrix using read.table and write.table.

iii. Write and read a large matrix using bigmemory package

iv. Testing using

Test Results

Platform Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.


Total Elapsed Time(sec)


Total Elapsed Time(sec)

File size on disk (.csv)

Computation Time Saved by bigmemory





1.7GB MB







55% took 23.73 secs

*Test Discussion

The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters 'backingfile' and 'descriptorfile'. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix('BigMem.desc'). This way several R processes can share memory objects via 'call by reference'.
The .desc file is an S4 type object ->
i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using 'call by reference'

Testing Higher Order Functions
Test Background & Motivation

Does applying apply functions instead of looping constructs save computation time? In this test the for loop is tested against 'apply' functions.

Testing tools
Test Scenario

i . Set up a large list with some fake data, Sims, containing 100000 5*5 matrices.

ii. Now, we define four functions that extract the fifth column of each matrix in Sims and create a 100000* 5 matrix of results.

Test Results

Platform Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.


Total Elapsed Time(sec)





Myth Busting

This article presents some clever workaround to speed up R computations -> The tests here are based on verifying if the statements are myth or true.
Platform Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

Myth 1:


Elapsed total time to compute (secs)




This was verified to be correct. The mean method is slow.

Myth 2:

x <- rnorm(50000)


Elapsed total time to compute (secs)



As N increases the 'var' method performs equivalently well. Applying the equation does not speed up things

Myth 3:
Pre-allocating a vector is faster than assigning 'NA'.


Elapsed total time to compute (secs)


23.61 (24% Faster)

This was verified to be correct.

  • None