Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

I want to randomly select a specific number of rows from a fastq file pair, within a Perl script. After some Googling I was surprised not to find a simple (and scalable) method using Perl alone, at least for potentially huge files. The best suggestion looked to be something I gleaned on PerlMonks, posted in 2009 I think, which involved creating an array of numbers (an index), shuffling it, and then taking what you need. This breaks down if you want to sample from a huge range—even if you only need a small sample (e.g., 1000 random integers from the range (1, 10000000000). Proposed workarounds to the “without replacement” requirement with a huge range involved generating random numbers and storing them in a hash to check for repeats, but this gets slower as you sample a larger and larger proportion of the range.
The ‘sample’ function in R does this very efficiently, and it’s even faster using ‘sample.int,’ if integers are what you’re after. Calling these functions from Perl is pretty simple using the Statistics::R module.
First, let’s see how these work just in R:
1
2
3
4
5
sample(1:1000000000, 10)
 [1] 708203128 315330985 962188958  49319866 132041223 226927166 810373363 790653185 876269370 494039577
 
sample.int(1000000000, 10)
 [1] 730459751 994357917  69492291 565544309 450504715 334939544 190179439 218113785 808402368 289432141
To use these functions in Perl you need to install the Statistics::R module, and to use ‘shuffle’ you’ll need List::Util. So let’s compare the suggested Perl-only approach to the two Perl-R approaches:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/perl
use strict;
use warnings;
use Statistics::R;
use List::Util 'shuffle';
 
my $max_number = 100000000;
 
# The perl way
my $start = time;
my @my_array = 1..$max_number;
my @shuffled_array = shuffle(@my_array);
print "\nShuffle approach:\t", (time - $start),"sec\n";
print "\n@shuffled_array[1..10]\n";
 
# R way 1
$start = time;
my $R = Statistics::R->new();
$R->set('x', $max_number);
$R->run( q'sample_for_perl = sample(1:x, 10)' );
my $Rsample = $R->get('sample_for_perl');
print "\nR sample:\t\t", (time - $start),"sec\n";
print "@$Rsample\n";
$R->stop();
 
# R way 2
$start = time;
my $R2 = Statistics::R->new();
$R2->set('x', $max_number);
$R2->run( q'sample_for_perl = sample.int(x, 10)' );
my $Rsample2 = $R2->get('sample_for_perl');
print "\nR sample.int:\t\t", (time - $start),"sec\n";
print "@$Rsample2\n";
$R2->stop();
 
# Shuffle approach: 70sec
  41313044 28177140 86142374 84745209 32922084 40075115 41773959 71032809 53190908 70537181
 
# R sample:         3sec
  3569784 94379881 22320132 19199157 54432698 53404741 70516998 28210788 96616635 22291399
 
# R sample.int:     0sec
  66635186 41389081 94760016 89810397 22677469 90768414 89380410 69136310 65013711 60463558
And here are some benchmarks over various array sizes:
PerlShuffleCompare
That’s a pretty substantial speed-up! To be fair, if you needed a large sample from a huge range you would likely need to load these into an array after generating them in R, but it would still be a massive savings in time and memory not having to shuffle. I’m quite interested in hearing drawbacks to my approach, or better/more modern alternatives in Perl alone.