data <- read.delim( file.choose(), sep="\t");
head(data)
summary(data)
Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range
Leave a reply
I want to randomly select a specific number of rows from a fastq file pair, within a Perl script. After some Googling I was surprised not to find a simple (and scalable) method using Perl alone, at least for potentially huge files. The best suggestion looked to be something I gleaned on PerlMonks, posted in 2009 I think, which involved creating an array of numbers (an index), shuffling it, and then taking what you need. This breaks down if you want to sample from a huge range—even if you only need a small sample (e.g., 1000 random integers from the range (1, 10000000000). Proposed workarounds to the “without replacement” requirement with a huge range involved generating random numbers and storing them in a hash to check for repeats, but this gets slower as you sample a larger and larger proportion of the range.
The ‘sample’ function in R does this very efficiently, and it’s even faster using ‘sample.int,’ if integers are what you’re after. Calling these functions from Perl is pretty simple using the Statistics::R module.
First, let’s see how these work just in R:
1
2
3
4
5
| sample (1:1000000000, 10) [1] 708203128 315330985 962188958 49319866 132041223 226927166 810373363 790653185 876269370 494039577 sample.int (1000000000, 10) [1] 730459751 994357917 69492291 565544309 450504715 334939544 190179439 218113785 808402368 289432141 |
To use these functions in Perl you need to install the Statistics::R module, and to use ‘shuffle’ you’ll need List::Util. So let’s compare the suggested Perl-only approach to the two Perl-R approaches:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| #!/usr/bin/perl use strict; use warnings; use Statistics::R; use List::Util 'shuffle' ; my $max_number = 100000000; # The perl way my $start = time ; my @my_array = 1.. $max_number ; my @shuffled_array = shuffle( @my_array ); print "\nShuffle approach:\t" , ( time - $start ), "sec\n" ; print "\n@shuffled_array[1..10]\n" ; # R way 1 $start = time ; my $R = Statistics::R->new(); $R ->set( 'x' , $max_number ); $R ->run( q'sample_for_perl = sample(1:x, 10)' ); my $Rsample = $R ->get( 'sample_for_perl' ); print "\nR sample:\t\t" , ( time - $start ), "sec\n" ; print "@$Rsample\n" ; $R ->stop(); # R way 2 $start = time ; my $R2 = Statistics::R->new(); $R2 ->set( 'x' , $max_number ); $R2 ->run( q'sample_for_perl = sample.int(x, 10)' ); my $Rsample2 = $R2 ->get( 'sample_for_perl' ); print "\nR sample.int:\t\t" , ( time - $start ), "sec\n" ; print "@$Rsample2\n" ; $R2 ->stop(); # Shuffle approach: 70sec 41313044 28177140 86142374 84745209 32922084 40075115 41773959 71032809 53190908 70537181 # R sample: 3sec 3569784 94379881 22320132 19199157 54432698 53404741 70516998 28210788 96616635 22291399 # R sample.int: 0sec 66635186 41389081 94760016 89810397 22677469 90768414 89380410 69136310 65013711 60463558 |
And here are some benchmarks over various array sizes:
That’s a pretty substantial speed-up! To be fair, if you needed a large sample from a huge range you would likely need to load these into an array after generating them in R, but it would still be a massive savings in time and memory not having to shuffle. I’m quite interested in hearing drawbacks to my approach, or better/more modern alternatives in Perl alone.