data <- read.delim( file.choose(), sep="\t");
head(data)
summary(data)
Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range
Leave a reply
I want to randomly select a specific number of rows from a fastq file pair, within a Perl script. After some Googling I was surprised not to find a simple (and scalable) method using Perl alone, at least for potentially huge files. The best suggestion looked to be something I gleaned on PerlMonks, posted in 2009 I think, which involved creating an array of numbers (an index), shuffling it, and then taking what you need. This breaks down if you want to sample from a huge range—even if you only need a small sample (e.g., 1000 random integers from the range (1, 10000000000). Proposed workarounds to the “without replacement” requirement with a huge range involved generating random numbers and storing them in a hash to check for repeats, but this gets slower as you sample a larger and larger proportion of the range.
The ‘sample’ function in R does this very efficiently, and it’s even faster using ‘sample.int,’ if integers are what you’re after. Calling these functions from Perl is pretty simple using the Statistics::R module.
First, let’s see how these work just in R:
1
2
3
4
5
| sample(1:1000000000, 10) [1] 708203128 315330985 962188958 49319866 132041223 226927166 810373363 790653185 876269370 494039577sample.int(1000000000, 10) [1] 730459751 994357917 69492291 565544309 450504715 334939544 190179439 218113785 808402368 289432141 |
To use these functions in Perl you need to install the Statistics::R module, and to use ‘shuffle’ you’ll need List::Util. So let’s compare the suggested Perl-only approach to the two Perl-R approaches:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| #!/usr/bin/perluse strict;use warnings;use Statistics::R;use List::Util 'shuffle';my $max_number = 100000000;# The perl waymy $start = time;my @my_array = 1..$max_number;my @shuffled_array = shuffle(@my_array);print "\nShuffle approach:\t", (time - $start),"sec\n";print "\n@shuffled_array[1..10]\n";# R way 1$start = time;my $R = Statistics::R->new();$R->set('x', $max_number);$R->run( q'sample_for_perl = sample(1:x, 10)' );my $Rsample = $R->get('sample_for_perl');print "\nR sample:\t\t", (time - $start),"sec\n";print "@$Rsample\n";$R->stop();# R way 2$start = time;my $R2 = Statistics::R->new();$R2->set('x', $max_number);$R2->run( q'sample_for_perl = sample.int(x, 10)' );my $Rsample2 = $R2->get('sample_for_perl');print "\nR sample.int:\t\t", (time - $start),"sec\n";print "@$Rsample2\n";$R2->stop();# Shuffle approach: 70sec 41313044 28177140 86142374 84745209 32922084 40075115 41773959 71032809 53190908 70537181# R sample: 3sec 3569784 94379881 22320132 19199157 54432698 53404741 70516998 28210788 96616635 22291399# R sample.int: 0sec 66635186 41389081 94760016 89810397 22677469 90768414 89380410 69136310 65013711 60463558 |
And here are some benchmarks over various array sizes:
That’s a pretty substantial speed-up! To be fair, if you needed a large sample from a huge range you would likely need to load these into an array after generating them in R, but it would still be a massive savings in time and memory not having to shuffle. I’m quite interested in hearing drawbacks to my approach, or better/more modern alternatives in Perl alone.
