my @array = qw(foo bar foo bar baz foo baz bar foo);
my %counts = ();
for (@array) {
$counts{$_}++;
}
foreach my $keys (keys %counts) {
print "$keys = $counts{$keys}\n";
}
Wednesday, October 22, 2014
Unique values in an array in Perl
Unique values in an array in Perl
In this part of the Perl tutorial we are going to see how to
make sure we only have distinct values in an array.
Perl 5 does not have a built in function to filter out duplicate values from an array, but there are several solutions to the problem.
In order to use this module you'll have to install it from CPAN.
We start with an empty hash so when we encounter the first "foo", $seen{"foo"} does not exist and thus its value is undef which is considered false in Perl. Meaning we have not seen this value yet. We push the value to the end of the new @uniq array where we are going to collect the distinct values.
We also set the value of $seen{"foo"} to 1. Actually any value would do as long as it is considered "true" by Perl.
The next time we encounter the same string we already have that key in the %seen hash and its value is true, so the if condition will fail, and we won't push the duplicate value in the resulting array.
It is basically a filter. You provide an array on the right hand side and an expression in the block. The grep function will take each value of the array one-by-one, put it in $_, the default scalar variable of Perl and then execute the block. If the block evaluates to TRUE, the value can pass. If the block evaluates to FALSE the current value is filtered out.
That's how we got to this expression:
Inline:
map has a similar syntax to grep: a block and an array (or a list of values). It goes over all the elements of the array, executes the block and passes the result to the left.
In our case, for every value in the array it will pass the value itself followed by the number 1. Remember =>, aka. fat comma, is just a comma. Assuming @data has ('a', 'b', 'a') in it, this expression will return ('a', 1, 'b', 1, 'a', 1).
Finally, starting from perl version 5.14, we can call the keys function on hash references as well. Thus we can write:
input.txt:
expected output:
Perl 5 does not have a built in function to filter out duplicate values from an array, but there are several solutions to the problem.
List::MoreUtils
Depending on your situation, probably the simplest way is to use the uniq function of the List::MoreUtils module from CPAN.A full example is this:
- use List::MoreUtils qw(uniq);
- my @words = qw(foo bar baz foo zorg baz);
- my @unique_words = uniq @words;
The result is:
- use strict;
- use warnings;
- use 5.010;
- use List::MoreUtils qw(uniq);
- use Data::Dumper qw(Dumper);
- my @words = qw(foo bar baz foo zorg baz);
- my @unique_words = uniq @words;
- say Dumper \@unique_words;
$VAR1 = [ 'foo', 'bar', 'baz', 'zorg' ];For added fun the same module also provides a function called distinct, which is just an alias of the uniq function.
In order to use this module you'll have to install it from CPAN.
Home made uniq
If you cannot install the above module for whatever reason, or if you think the overhead of loading it is too big, there is a very short expression that will do the same:This, of course can look cryptic to someone who does not know it already, so it is recommended to define your own uniq subroutine, and use that in the rest of the code:
- my @unique = do { my %seen; grep { !$seen{$_}++ } @data };
- use strict;
- use warnings;
- use 5.010;
- use Data::Dumper qw(Dumper);
- my @words = qw(foo bar baz foo zorg baz);
- my @unique = uniq( @words );
- say Dumper \@unique_words;
- sub uniq {
- my %seen;
- return grep { !$seen{$_}++ } @_;
- }
Home made uniq explained
I can't just throw this example here and leave it like that. I'd better explain it. Let's start with an easier version:Here we are using a regular foreach loop to go over the values in the original array, one by one. We use a helper hash called %seen. The nice thing about the hashes is that their keys are unique.
- my @unique;
- my %seen;
- foreach my $value (@words) {
- if (! $seen{$value}) {
- push @unique, $value;
- $seen{$value} = 1;
- }
- }
We start with an empty hash so when we encounter the first "foo", $seen{"foo"} does not exist and thus its value is undef which is considered false in Perl. Meaning we have not seen this value yet. We push the value to the end of the new @uniq array where we are going to collect the distinct values.
We also set the value of $seen{"foo"} to 1. Actually any value would do as long as it is considered "true" by Perl.
The next time we encounter the same string we already have that key in the %seen hash and its value is true, so the if condition will fail, and we won't push the duplicate value in the resulting array.
Shortening the home made unique function
First of all we replace the assignment of 1 $seen{$value} = 1; by the post-increment operator $seen{$value}++. This does not change the behavior of the previous solution - any positive number is going to be evaluated as TRUE, but it will allow us to include the setting of the "seen flag" within the if condition. It is important that this is a postfix increment (and not a prefix increment) as this means the increment only takes place after the boolean expression was evaluated. The first time we encounter a value the expression will be TRUE and the rest of the times it will be FALSE.This is shorter, but we can do even better.
- my @unique;
- my %seen;
- foreach my $value (@data) {
- if (! $seen{$value}++ ) {
- push @unique, $value;
- }
- }
Filtering duplicate values using grep
The grep function in Perl is a generalized form of the well known grep command of Unix.It is basically a filter. You provide an array on the right hand side and an expression in the block. The grep function will take each value of the array one-by-one, put it in $_, the default scalar variable of Perl and then execute the block. If the block evaluates to TRUE, the value can pass. If the block evaluates to FALSE the current value is filtered out.
That's how we got to this expression:
- my %seen;
- my @unique = grep { !$seen{$_}++ } @words;
Wrapping it in 'do' or in 'sub'
The last little thing we have to do, is wrapping the above two statements in either a do blockor, better yet, in a function with an expressive name:
- my @unique = do { my %seen; grep { !$seen{$_}++ } @words };
- sub uniq {
- my %seen;
- return grep { !$seen{$_}++ } @_;
- }
Home made uniq - round 2
Prakash Kailasa suggested an even shorted version of implementing uniq, for perl version 5.14 and above, if there is no requirement to preserve the order of elements.Inline:
or within a subroutine:
- my @unique = keys { map { $_ => 1 } @data };
Let's take this expression apart:
- my @unique = uniq(@data);
- sub uniq { keys { map { $_ => 1 } @_ } };
map has a similar syntax to grep: a block and an array (or a list of values). It goes over all the elements of the array, executes the block and passes the result to the left.
In our case, for every value in the array it will pass the value itself followed by the number 1. Remember =>, aka. fat comma, is just a comma. Assuming @data has ('a', 'b', 'a') in it, this expression will return ('a', 1, 'b', 1, 'a', 1).
If we assigned that expression to a hash, we would get the original data as keys, and the number 1-es as values. Try this:
- map { $_ => 1 } @data
and you will get:
- use strict;
- use warnings;
- use Data::Dumper;
- my @data = qw(a b a);
- my %h = map { $_ => 1 } @data;
- print Dumper \%h;
$VAR1 = { 'a' => 1, 'b' => 1 };If, instead of assigning it to an array we wrap the above expression in curly braces, we will get a reference to an anonymous hash.
Let's see it in action:
- { map { $_ => 1 } @data }
Will print the same output as the previous one, barring any change in order in the dumping of the hash.
- use strict;
- use warnings;
- use Data::Dumper;
- my @data = qw(a b a);
- my $hr = { map { $_ => 1 } @data };
- print Dumper $hr;
Finally, starting from perl version 5.14, we can call the keys function on hash references as well. Thus we can write:
and we'll get back the unique values from @data
- my @unique = keys { map { $_ => 1 } @data };
Exercise
Given the following file print out the unique values:input.txt:
foo Bar bar first second Foo foo another fooexpected output:
foo Bar bar first second Foo another
Exercise 2
This time filter out duplicates regardless of case.expected output:
foo Bar first second another
Written by
Gabor Szabo
Gabor Szabo
Do you want to improve your Perl?
In order to register to our newsletter, please type your e-mail here: Registered people will be notified when a new article is published on the Perl Maven web site.Published on 2012-09-20
Comments
In the comments, please wrap your code snippets within <pre> </pre> tags and use spaces for indentation.Do you want to improve your Perl?
Register to the FREE Perl Maven newsletter here. or to the Perl Weekly newletter over thereDo you want to become a Pro?
Check out the Perl Maven Pro.
Related articles
For previous articles check out the Archive,
or type in a keyword at the top of the site.
sort hash by value and return the associated key
If instead you want to sort by numeric hash values, you'd write:
foreach my $key (sort { $hash{$a} <=> $hash{$b} } keys %hashes) { do something; }
Friday, April 11, 2014
one line---command
Basic awk & sed
from : https://github.com/stephenturner/oneliners
Extract
fields 2, 4, and 5 from file.txt:
awk '{print $2,$4,$5}'
input.txt
Print
each line where the 5th field is equal to ‘abc123’:
awk '$5 == "abc123"'
file.txt
Print
each line where the 5th field is not equal to ‘abc123’:
awk '$5 != "abc123"'
file.txt
Print
each line whose 7th field matches the regular expression:
awk '$7 ~ /^[a-f]/' file.txt
Print
each line whose 7th field does not match the regular
expression:
awk '$7 !~ /^[a-f]/' file.txt
Get
unique entries in file.txt based on column 1 (takes only the first instance):
awk '!arr[$2]++' file.txt
Print
rows where column 3 is larger than column 5 in file.txt:
awk '$3>$5' file.txt
Sum
column 1 of file.txt:
awk '{sum+=$1} END {print sum}'
file.txt
Compute
the mean of column 2:
awk '{x+=$2}END{print x/NR}'
file.txt
Number
each line in file.txt:
sed = file.txt | sed 'N;s/\n/
/'
Replace
all occurances of foo with bar in file.txt:
sed 's/foo/bar/g' file.txt
Trim
leading whitespace in file.txt:
sed 's/^[ \t]*//' file.txt
Trim
trailing whitespace in file.txt:
sed 's/[ \t]*$//' file.txt
Trim
leading and trailing whitespace in file.txt:
sed 's/^[ \t]*//;s/[ \t]*$//'
file.txt
Delete
blank lines in file.txt:
sed '/^$/d' file.txt
Returns
all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in
column 1 and position in column 3 (this same concept can be used to return only
variants that above specific allele frequencies):
cat file.txt | awk
'$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'
Basic
sequence statistics. Print total number of reads, total number unique reads,
percentage of unique reads, most abundant sequence, its frequency, and
percentage of total in file.fq:
cat myfile.fq | awk
'((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in
count){if(!max||count[read]>max)
{max=count[read];maxRead=read};if(count[read]==1){unique++}};print
total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
Convert
.bam back to .fastq:
samtools view file.bam | awk
'BEGIN {FS="\t"} {print "@" $1 "\n" $10
"\n+\n" $11}' > file.fq
Keep
only top bit scores in blast hits (best bit score only):
awk '{ if(!x[$1]++) {print $0;
bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt
Keep
only top bit scores in blast hits (5 less than the top):
awk '{ if(!x[$1]++) {print $0;
bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt
Split
a multi-FASTA file into individual FASTA files:
awk
'/^>/{s=++d".fa"} {print > s}' multi.fa
Convert
a FASTQ file to FASTA:
sed -n '1~4s/^@/>/p;2~4p'
file.fq > file.fa
Extract
every 4th line starting at the second line (extract the sequence from FASTQ
file):
sed -n '2~4p' file.fq
Count
the number of unique lines in file.txt
cat file.txt | sort | uniq | wc
-l
Find
number of lines shared by 2 files:
sort file1 file2 | uniq -d
Sort
numerically (with logs) (g) by column (k) 9:
sort -gk9 file.txt
Find
the most common strings in column 2:
cut -f2 file.txt | sort | uniq
-c | sort -k1nr | head
Pick
10 random lines from a file:
shuf file.txt | head -n 10
Print
all possible 3mer DNA sequence combinations:
echo
{A,C,T,G}{A,C,T,G}{A,C,T,G}
Untangle
an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads
intermingled, and you want to separate them into separate /1 and /2 files, and
assuming the /1 reads precede the /2 reads:
cat interleaved.fq |paste - - -
- - - - - | tee >(cut -f 1-4 | tr "\t" "\n" >
deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" >
deinterleaved_2.fq
Download
GNU parallel at https://www.gnu.org/software/parallel/.
Search
for .bam files anywhere in the current directory recursively:
find . -name "*.bam"
Delete
all .bam files:
find . -name "*.bam"
| xargs rm
Rename
all .txt files to .bak (backup *.txt before doing something else to them, for
example):
find . -name "*.txt"
| sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh
Chastity
filter raw Illumina data (grep reads containing :N:,
append (-A) the three lines after the match containing the sequence and quality
info, and write a new filtered fastq file):
find *fq | parallel "cat
{} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' >
{}.filt.fq"
Run
FASTQC in parallel 12 jobs at a time:
find *.fq | parallel -j 12 "fastqc {} --outdir ."
Index
your bam files in parallel, but only echo the commands (--dry-run)
rather than actually running them:
find *.bam | parallel --dry-run
'samtools index {}'
Download
seqtk at https://github.com/lh3/seqtk.
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or
FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be
optionally compressed by gzip.
Convert
FASTQ to FASTA:
seqtk seq -a in.fq.gz >
out.fa
Convert
ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to
lowercases (the 1st command line) or to N (the 2nd):
seqtk seq -aQ64 -q20 in.fq >
out.fa seqtk seq -aQ64 -q20 -n N in.fq > out.fa
Fold
long FASTA/Q lines and remove FASTA/Q comments:
seqtk seq -Cl60 in.fa >
out.fa
Convert
multi-line FASTQ to 4-line FASTQ:
seqtk seq -l0 in.fq > out.fq
Reverse
complement FASTA/Q:
seqtk seq -r in.fq > out.fq
Extract
sequences with names in file name.lst, one sequence name per line:
seqtk subseq in.fq name.lst
> out.fq
Extract
sequences in regions contained in file reg.bed:
seqtk subseq in.fa reg.bed >
out.fa
Mask
regions in reg.bed to lowercases:
seqtk seq -M reg.bed in.fa >
out.fa
Subsample
10000 read pairs from two large paired FASTQ files (remember to use the same
random seed to keep pairing):
seqtk sample -s100 read1.fq
10000 > sub1.fq seqtk sample -s100 read2.fq 10000 > sub2.fq
Trim
low-quality bases from both ends using the Phred algorithm:
seqtk trimfq in.fq > out.fq
Trim
5bp from the left end of each read and 10bp from the right end:
seqtk trimfq -b 5 -e 10 in.fa
> out.fa
Untangle
an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads
intermingled, and you want to separate them into separate /1 and /2 files, and
assuming the /1 reads precede the /2 reads:
seqtk seq -l0 -1 interleaved.fq
> deinterleaved_1.fq seqtk seq -l0 -2 interleaved.fq > deinterleaved_2.fq
Print
all sequences annotated in a GFF3 file.
cut -s -f 1,9 yourannots.gff3 |
grep $'\t' | cut -f 1 | sort | uniq
Determine
all feature types annotated in a GFF3 file.
grep -v '^#' yourannots.gff3 |
cut -s -f 3 | sort | uniq
Determine
the number of genes annotated in a GFF3 file.
grep -c $'\tgene\t'
yourannots.gff3
Extract
all gene IDs from a GFF3 file.
grep $'\tgene\t'
yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)'
Print
length of each gene in a GFF3 file.
grep $'\tgene\t'
yourannots.gff3 | cut -s -f 4,5 | perl -ne '@v = split(/\t/);
printf("%d\n", $v[1] - $v[0] + 1)'
FASTA
header lines to GFF format (assuming the length is in the header as an appended
"_length" as in Velvetassembled
transcripts):
grep '>' file.fasta | awk -F
"_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t
BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++
}' > file.gff
Get
a prompt that looks like user@hostname:/full/path/cwd/:$
export PS1="\u@\h:\w\\$
"
Never
type cd ../../.. again
(or use autojump,
which enables you to navigate the filesystem faster):
alias ..='cd ..' alias ...='cd
../../' alias ....='cd ../../../' alias .....='cd ../../../../' alias
......='cd ../../../../../'
Ask
before removing or overwriting files:
alias mv="mv -i"
alias cp="cp -i" alias
rm="rm -i"
My
favorite ls aliases:
alias ls="ls -1p
--color=auto" alias l="ls -lhGgo" alias ll="ls -lh" alias
la="ls -lhGgoA" alias lt="ls -lhGgotr" alias lS="ls
-lhGgoSr" alias l.="ls -lhGgod .*" alias lhead="ls -lhGgo |
head" alias ltail="ls -lhGgo | tail" alias lmore='ls -lhGgo |
more'
Use cut on
space- or comma- delimited files:
alias cuts="cut -d \"
\"" alias cutc="cut -d \",\""
Pack
and unpack tar.gz files:
alias tarup="tar
-zcf" alias tardown="tar -zxf"
Or
use a generalized extract function:
# as suggested by Mendel Cooper
in "Advanced Bash Scripting Guide" extract () { if [ -f $1 ] ; then case $1 in *.tar.bz2) tar xvjf $1 ;; *.tar.gz) tar xvzf $1 ;; *.tar.xz) tar Jxvf $1 ;; *.bz2) bunzip2 $1 ;; *.rar) unrar x $1 ;; *.gz) gunzip $1 ;; *.tar) tar xvf $1 ;; *.tbz2) tar xvjf $1 ;; *.tgz) tar xvzf $1 ;; *.zip) unzip $1 ;; *.Z) uncompress $1 ;; *.7z) 7z x $1 ;; *) echo "don't know how to
extract '$1'..." ;; esac else
echo "'$1' is not a valid file!" fi }
Use mcd to
create a directory and cd to it simultaneously:
function mcd { mkdir -p
"$1" && cd "$1";}
Go
up to the parent directory and list it's contents:
alias u="cd ..;ls"
Make
grep pretty:
alias grep="grep
--color=auto"
Refresh
your .bashrc:
alias refresh="source
~/.bashrc"
Common
typos:
alias mf="mv -i"
alias mroe="more" alias c='clear'
Subscribe to:
Posts (Atom)