Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

Leave a reply

I want to randomly select a specific number of rows from a fastq file pair, within a Perl script. After some Googling I was surprised not to find a simple (and scalable) method using Perl alone, at least for potentially huge files. The best suggestion looked to be something I gleaned on PerlMonks, posted in 2009 I think, which involved creating an array of numbers (an index), shuffling it, and then taking what you need. This breaks down if you want to sample from a huge range—even if you only need a small sample (e.g., 1000 random integers from the range (1, 10000000000). Proposed workarounds to the “without replacement” requirement with a huge range involved generating random numbers and storing them in a hash to check for repeats, but this gets slower as you sample a larger and larger proportion of the range.

The ‘sample’ function in R does this very efficiently, and it’s even faster using ‘sample.int,’ if integers are what you’re after. Calling these functions from Perl is pretty simple using the Statistics::R module.

First, let’s see how these work just in R:

sample(1:1000000000, 10)

 [1] 708203128 315330985 962188958  49319866 132041223 226927166 810373363 790653185 876269370 494039577

sample.int(1000000000, 10)

 [1] 730459751 994357917  69492291 565544309 450504715 334939544 190179439 218113785 808402368 289432141

To use these functions in Perl you need to install the Statistics::R module, and to use ‘shuffle’ you’ll need List::Util. So let’s compare the suggested Perl-only approach to the two Perl-R approaches:

#!/usr/bin/perl

use strict;

use warnings;

use Statistics::R;

use List::Util 'shuffle';

my $max_number = 100000000;

# The perl way

my $start = time;

my @my_array = 1..$max_number;

my @shuffled_array = shuffle(@my_array);

print "\nShuffle approach:\t", (time - $start),"sec\n";

print "\n@shuffled_array[1..10]\n";

# R way 1

$start = time;

my $R = Statistics::R->new();

$R->set('x', $max_number);

$R->run( q'sample_for_perl = sample(1:x, 10)' );

my $Rsample = $R->get('sample_for_perl');

print "\nR sample:\t\t", (time - $start),"sec\n";

print "@$Rsample\n";

$R->stop();

# R way 2

$start = time;

my $R2 = Statistics::R->new();

$R2->set('x', $max_number);

$R2->run( q'sample_for_perl = sample.int(x, 10)' );

my $Rsample2 = $R2->get('sample_for_perl');

print "\nR sample.int:\t\t", (time - $start),"sec\n";

print "@$Rsample2\n";

$R2->stop();

# Shuffle approach: 70sec

  41313044 28177140 86142374 84745209 32922084 40075115 41773959 71032809 53190908 70537181

# R sample:         3sec

  3569784 94379881 22320132 19199157 54432698 53404741 70516998 28210788 96616635 22291399

# R sample.int:     0sec

  66635186 41389081 94760016 89810397 22677469 90768414 89380410 69136310 65013711 60463558

And here are some benchmarks over various array sizes:

That’s a pretty substantial speed-up! To be fair, if you needed a large sample from a huge range you would likely need to load these into an array after generating them in R, but it would still be a massive savings in time and memory not having to shuffle. I’m quite interested in hearing drawbacks to my approach, or better/more modern alternatives in Perl alone.

evolution & bioinformatics

Wednesday, August 12, 2015

open a file with R though browser

Thursday, July 16, 2015

Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

referred from https://jpwendler.wordpress.com/2013/12/30/integrating-r-and-perl-to-speed-up-random-sampling-without-replacement-of-a-huge-numeric-range/

Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

About Me