Friday, December 21, 2012

heatmap

eatmaps, or false color images have a reasonably long history, as has the
notion of rearranging the columns and rows to show structure in the data.
They were applied to microarray data by Eisen et al. (1998) and have
become a standard visualization method for this type of data.
A heatmap is a two-dimensional, rectangular, colored grid. It displays
data that themselves come in the form of a rectangular matrix. The color
of each rectangle is determined by the value of the corresponding entry
in the matrix. The rows and columns of the matrix can be rearranged
independently. Usually they are reordered so that similar rows are placed
next to each other, and the same for columns. Among the orderings that
are widely used are those derived from a hierarchical clustering, but many
other orderings are possible. If hierarchical clustering is used, then it is
customary that the dendrograms are provided as well. In many cases the
resulting image has rectangular regions that are relatively homogeneous
and hence the graphic can aid in determining which rows (generally the
genes) have similar expression values within which subgroups of samples
(generally the columns).
The function heatmap is an implementation with many options. In particular,
users can control the ordering of rows and columns independently
from each other. They can use row and column labels of their own choosing
or select their own color scheme.

> library("ALL")
> data("ALL")
> selSamples <- ALL$mol.biol %in% c("ALL1/AF4",
+ "E2A/PBX1")
> ALLs <- ALL[, selSamples]
> ALLs$mol.biol <- factor(ALLs$mol.biol)
> colnames(exprs(ALLs)) <- paste(ALLs$mol.biol,
+ colnames(exprs(ALLs)))

>library("genefilter")
> meanThr <- log2(100)
> g <- ALLs$mol.biol
> s1 <- rowMeans(exprs(ALLs)[, g == levels(g)[1]]) >
+ meanThr
> s2 <- rowMeans(exprs(ALLs)[, g == levels(g)[2]]) >
+ meanThr
> s3 <- rowttests(ALLs, g)$p.value < 2e-04
> selProbes <- (s1 | s2) & s3
> ALLhm <- ALLs[selProbes, ]

>library(RColorBrewer)

> hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
> spcol <- ifelse(ALLhm$mol.biol == "ALL1/AF4",
+ "goldenrod", "skyblue")
> heatmap(exprs(ALLhm), col = hmcol, ColSideColors = spcol)

最后，可以

>help(heatmap) 查找帮助，看看帮助给提供的例子

也可以看这的例子：

http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/

Using R to draw a Heatmap from Microarray Data

[c]

The first section of this page uses R to analyse an Acute lymphocytic leukemia (ALL) microarray dataset, producing a heatmap (with dendrograms) of genes differentially expressed between two types of leukemia.

There is a follow on page dealing with how to do this from Python using RPy.

The original citation for the raw data is "Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival" by Chiaretti et al. Blood 2004. (PMID: 14684422)

The analysis is a "step by step" recipe based on this paper, Bioconductor: open software development for computational biology and bioinformatics, Gentleman et al. 2004. Their Figure 2 Heatmap, which we recreate, is reproduced here:
[Published Heatmap, Gentleman et al. 2004]

[Published Heatmap, Gentleman et al. 2004]

Heatmaps from R

Assuming you have a recent version of R (from The R Project) and BioConductor (see Windows XP installation instructions), the example dataset can be downloaded as the BioConductor ALL package.

You should be able to install this from within R as follows:

> source("http://www.bioconductor.org/biocLite.R")
> biocLite("ALL")

Running bioCLite version 0.1  with R version  2.1.1 
...

Alternatively, you can download the package by hand from here or here.

If you are using Windows, download ALL_1.0.2.zip (or similar) and save it. Then from within the R program, use the menu option "Packages", "Install package(s) from local zip files..." and select the ZIP file.

On Linux, download ALL_1.0.2.tar.gz (or similar) and use sudo R CMD INSTALL ALL_1.0.2.tar.gz at the command prompt.

With that out of the way, you should be able to start R and load this package with the library and data commands:

> library("ALL")
Loading required package: Biobase
Loading required package: tools
Welcome to Bioconductor 
         Vignettes contain introductory material.  To view, 
         simply type: openVignette() 
         For details on reading vignettes, see
         the openVignette help page.
> data("ALL")

If you inspect the resulting ALL variable, it contains 128 samples with 12625 genes, and associated phenotypic data.

> ALL
Expression Set (exprSet) with 
        12625 genes
        128 samples
                 phenoData object with 21 variables and 128 cases
         varLabels
                cod:  Patient ID
                diagnosis:  Date of diagnosis
                sex:  Gender of the patient
                age:  Age of the patient at entry
                BT:  does the patient have B-cell or T-cell ALL
                remission:  Complete remission(CR), refractory(REF) or NA. Derived from CR
                CR:  Original remisson data
                date.cr:  Date complete remission if achieved
                t(4;11):  did the patient have t(4;11) translocation. Derived from citog
                t(9;22):  did the patient have t(9;22) translocation. Derived from citog
                cyto.normal:  Was cytogenetic test normal? Derived from citog
                citog:  original citogenetics data, deletions or t(4;11), t(9;22) status
                mol.biol:  molecular biology
                fusion protein:  which of p190, p210 or p190/210 for bcr/able
                mdr:  multi-drug resistant
                kinet:  ploidy: either diploid or hyperd.
                ccr:  Continuous complete remission? Derived from f.u
                relapse:  Relapse? Derived from f.u
                transplant:  did the patient receive a bone marrow transplant? Derived from f.u
                f.u:  follow up data available
                date last seen:  date patient was last seen

We can looks at the results of molecular biology testing for the 128 samples:

> ALL$mol.biol
  [1] BCR/ABL  NEG      BCR/ABL  ALL1/AF4 NEG      NEG      NEG      NEG      NEG     
 [10] BCR/ABL  BCR/ABL  NEG      E2A/PBX1 NEG      BCR/ABL  NEG      BCR/ABL  BCR/ABL 
 [19] BCR/ABL  BCR/ABL  NEG      BCR/ABL  BCR/ABL  NEG      ALL1/AF4 BCR/ABL  ALL1/AF4
      ...  
[127] NEG      NEG     
Levels: ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16

Ignoring the samples which came back negative on this test (NEG), most have been classified as having a translocation between chromosomes 9 and 22 (BCR/ABL), or a translocation between chromosomes 4 and 11 (ALL1/AF4).

For the purposes of this example, we are only interested in these two subgroups, so we will create a filtered version of the dataset using this as a selection criteria:

> eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]

The resulting variable, eset, contains just 47 samples - each with the full 12,625 gene expression levels.

This is far too much data to draw a heatmap with, but we can do one for the first 100 genes as follows:

> heatmap(exprs(eset[1:100,]))

According to the BioConductor paper we are following, the next step in the analysis was to use the lmFit function (from the limma package) to look for genes differentially expressed between the two groups. The fitted model object is further processed by the eBayes function to produce empirical Bayes test statistics for each gene, including moderated t-statistics, p-values and log-odds of differential expression.

> library(limma)
> f <- factor(as.character(eset$mol.biol))
> design <- model.matrix(~f)
> fit <- eBayes(lmFit(eset,design))

If the limma package isn't installed, you'll need to install it first:

> source("http://www.bioconductor.org/biocLite.R")
> biocLite("limma")

Running bioCLite version 0.1  with R version  2.1.1 
...

We can now reproduce Figure 1 from the paper.

> topTable(fit, coef=2)
              ID         M        A         t      P.Value        B
1016     1914_at -3.076231 4.611284 -27.49860 5.887581e-27 56.32653
7884    37809_at -3.971906 4.864721 -19.75478 1.304570e-20 44.23832
6939    36873_at -3.391662 4.284529 -19.61497 1.768670e-20 43.97298
10865   40763_at -3.086992 3.474092 -17.00739 7.188381e-18 38.64615
4250    34210_at  3.618194 8.438482  15.45655 3.545401e-16 35.10692
11556   41448_at -2.500488 3.733012 -14.83924 1.802456e-15 33.61391
3389    33358_at -2.269730 5.191015 -12.96398 3.329289e-13 28.76471
8054    37978_at -1.036051 6.937965 -10.48777 6.468996e-10 21.60216
10579 40480_s_at  1.844998 7.826900  10.38214 9.092033e-10 21.27732
330      1307_at  1.583904 4.638885  10.25731 1.361875e-09 20.89145

The leftmost numbers are row indices, ID is the Affymetrix HGU95av2 accession number, M is the log ratio of expression, A is the log average expression, t the moderated t-statistic, and B is the log odds of differential expression.

Next, we select those genes that have adjusted p-values below 0.05, using a very stringent Holm method to select a small number (165) of genes.

> selected  <- p.adjust(fit$p.value[, 2]) <0.05
> esetSel <- eset [selected, ]

The variable esetSel has data on (only) 165 genes for all 47 samples . We can easily produce a heatmap as follows (in R-2.1.1 this defaults to a yellow/red "heat" colour scheme):

> heatmap(exprs(esetSel))

If you have the topographical colours installed (yellow-green-blue), you can do this:

> heatmap(exprs(esetSel), col=topo.colors(100))

This is getting very close to Gentleman et al.'s Figure 2, except they have added a red/blue banner across the top to really emphasize how the hierarchical clustering has correctly split the data into the two groups (10 and 37 patients).

To do that, we can use the heatmap function's optional argument of ColSideColors. I created a small function to map the eselSet$mol.biol values to red (#FF0000) and blue (#0000FF), which we can apply to each of the molecular biology results to get a matching list of colours for our columns:

> color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" }
> patientcolors <- unlist(lapply(esetSel$mol.bio, color.map))
> heatmap(exprs(esetSel), col=topo.colors(100), ColSideColors=patientcolors)

Looks pretty close now, doesn't it:
[Heatmap figure]

To recap, this is "all" we needed to type into R to achieve this:

library("ALL")
data("ALL")
eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]
library("limma")
f <- factor(as.character(eset$mol.biol))
design <- model.matrix(~f)
fit <- eBayes(lmFit(eset,design))
selected  <- p.adjust(fit$p.value[, 2]) <0.05
esetSel <- eset [selected, ]
color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" }
patientcolors <- unlist(lapply(esetSel$mol.bio, color.map))
heatmap(exprs(esetSel), col=topo.colors(100), ColSideColors=patientcolors)

Heatmaps in R - More Options

One subtle point in the previous examples is that the heatmap function has automatically scaled the colours for each row (i.e. each gene has been individually normalised across patients). This can be disabled using scale="none", which you might want to do if you have already done your own normalisation (or this may not be appropriate for your data):

heatmap(exprs(esetSel), col=topo.colors(75), scale="none", ColSideColors=patientcolors, cexRow=0.5)

You might also have noticed in the above snippet, that I have shrunk the row captions which were so big they overlapped each other. The relevant options are cexRow and cexCol.

So far so good - but what if you wanted a key to the colours shown? The heatmap function doesn't offer this, but the good news is that heatmap.2 from the gplots library does. In fact, it offers a lot of other features, many of which I deliberately turn off in the following example:

library("gplots")
heatmap.2(exprs(esetSel), col=topo.colors(75), scale="none", ColSideColors=patientcolors,
          key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)

[Heatmap picture, topographical colours WITHOUT scaling, with patient type colour bar and color key]

By default, heatmap.2 will also show a trace on each data point (removed this with trace="none"). If you ask for a key (using key=TRUE) this function will actually give you a combined "color key and histogram", but that can be overridden (with density.info="none").

Don't like the colour scheme? Try using the functions bluered/redblue for a red-white-blue spread, or redgreen/greenred for the red-black-green colour scheme often used with two-colour microarrays:

library("gplots")
heatmap.2(exprs(esetSel), col=redgreen(75), scale="row", ColSideColors=patientcolors,
           key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)

Heatmaps from Python

So, how can we do that from within Python? One way is using RPy (R from Python), and this is discussed on this page.

P.S. If you want to use heatmap.2 from within python using RPy, use the syntax heatmap_2 due to the differences in how R and Python handle full stops and underscores.

What about other microarray data?

Well, I have also documented how you can load NCBI GEO SOFT files into R as a BioConductor expression set object. As long as you can get your data into R as a matrix or data frame, converting it into an exprSet shouldn't be too hard.

Wednesday, December 19, 2012

Python: sort a list and change another one consequently

You could zip the lists and sort the result. Sorting tuples should, by default, sort on the first member.

>>> xs = [3,2,1]
>>> ys = [1,2,3]
>>> points = zip(xs,ys)
>>> points
[(3, 1), (2, 2), (1, 3)]
>>> sorted(points)
[(1, 3), (2, 2), (3, 1)]

And then to unpack them again:

>>> sorted_points = sorted(points)
>>> new_xs = [point[0] for point in sorted_points]
>>> new_ys = [point[1] for point in sorted_points]
>>> new_xs
[1, 2, 3]
>>> new_ys
[3, 2, 1]

Monday, December 17, 2012

python_course

http://www.lynda.com/Python-3-tutorials/essential-training/62226-2.html?utm_source=msn&utm_medium=cpc&utm_campaign=Search-Dev-Python&utm_content=1743940313&utm_term=python%20course

http://mlg.ucd.ie/courses/comp50050.html

Friday, December 14, 2012

how to get the IDLE in mac

open the Terminal utility and type 'idle' at the prompt.

Thursday, December 13, 2012

python3_learning_1

Getting started

To write a new Python program, create a new next file in your favourite editor, type (or copy & paste) some Python code, then save it with a filename ending in .py. Open a terminal window, navigate to the directory where you saved the file, and type the name of the Python executable, followed by the name of your file. On most systems, the Python 3 executable is called python3, so we will type:

python3 myscript.py

Note: I will sometimes refer in these notes to a ‘script’, which is pretty much a synonym for ‘program’.

Wherever you see a box in the notes that is prefixed with In[] we are looking at Python code, which can be copied and pasted into a text file to run it.

Printing and comments

To print some output to the screen in Python we use the print() function:

print("hello!")

Notice that

print is a function
we write the name of the function, followed by a pair of brackets
the stuff inside the brackets is the arguments to the function (sometimes called parameters)
in this case we have one argument – the thing we want to print
the text we want to print is inside double quotes (this is called a string)
the line ends with a newline (carridge return, enter)
the text shows up in different colours (syntax highlighting)
a complete line like this is called a statement

If we want to leave a note for ourselves, we start the line with a #. This is called a comment. We will make extensive use of comments in the code samples throughout these notes. Often, we write a comment which describes what the next line will do when it is executed:

# this next line will print a greeting
print("hello!")

This is a good habit to get into!

Here are a few code samples that will cause stuff to be printed to the screen. Copy and paste them into text files and try running the code. The output for each code sample is displayed below the box.

In [26]:

print("hello!")

hello!

In [27]:

#this is a comment - it is for humans to read
print("hello again")

hello again

In [28]:

#we can prevent a statement from being run by putting a # at the start of the line
#print("hello again")
print("goodbye")

goodbye

In [31]:

#we can use single quotes as well - sometimes this is easier
print('you say goodbye, I say hello')
#for instance, if we want to include a double quote in the output
print('you say "goodbye", I say "hello"')

you say goodbye, I say hello
you say "goodbye", I say "hello"

In [32]:

#if we forget the quotes we will get an error
print(hello)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/martin/Dropbox/projects/python_course/<ipython-input-32-d74c59043d06> in <module>()
      1 #if we forget the quotes we will get an error

----> 2 print(hello)

NameError: name 'hello' is not defined

In [33]:

#if we forget the brackets we will also get an error
print 'Hello'

  File "<ipython-input-33-dcd43251734a>", line 2
    print 'Hello'
                ^
SyntaxError: invalid syntax

Variables

We can give a name to a string. The name is called a variable

name = 'Martin'

We can concatenate two strings (stick them together) with +

name = 'Mar' + 'tin'

In [34]:

# once we have created a variable, we can use it in print
name = 'Martin'
print(name)

Martin

In [35]:

# we can change a variable after we have made it
name = 'Martin'
print(name)
name = 'John'
print(name)

Martin
John

In [36]:

# we can use concatenation to make new strings which we can print
print('hello ' + 'world')
# we are not allowed to use spaces in variable names so use an underscore
first_name = 'Martin'
print('hello ' +  first_name)
last_name = 'Jones'
# we can use a mixture of strings and variables 
print('hello ' + first_name + last_name)
# that is not quite right, we need a space between first and last names
print('hello ' + first_name + ' ' + last_name)

hello world
hello Martin
hello MartinJones
hello Martin Jones

In [37]:

# we can also store a new string in a variable and use it later
first_name = 'Martin'
last_name = 'Jones'
full_name = first_name + ' ' + last_name
print('hello ' + full_name)

hello Martin Jones

Strings and functions

To get the length of a string we use

len('hello Martin')

we can also use a variable

len(first_name)

len() is a function and the string that we give it is a parameter.

Another useful function is str(), which turns a number into a string.

str(42)

In [38]:

# we can print the output of a function
print(len('hello'))
# or we can store it
name_length = len('Martin')
print(name_length)

5
6

In [39]:

# the output from len() is not a string - it is a number
# so python will complain if we try to use it like a string
print('your name is has this many letters: ' + len('martin'))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/martin/Dropbox/projects/python_course/<ipython-input-39-4694ed1b9d25> in <module>()
      1 # the output from len() is not a string - it is a number

      2 # so python will complain if we try to use it like a string

----> 3 print('your name is has this many letters: ' + len('martin'))

TypeError: Can't convert 'int' object to str implicitly

In [40]:

# we have to use str() to turn it into a number
name_length = len('Martin')
print('your name is has this many letters: ' + str(name_length))

your name is has this many letters: 6

In [41]:

# we can do everything in one go like this
# we use the output of len() as the input for str()
print('your name is has this many letters: ' + str(len('martin')))  
# when we write it this way we have three consecutive close brackets at the end of the line
# make sure you have the same number of ( and )

your name is has this many letters: 6

Reading files

To open a text file, we use the open() function.

file = open('myfile.name')

open() takes one argument – the filename.
The output (or return value) is an object that represents the file.

To read the contents of a file, use the read() method on the file object.

contents = file.read()

This is slightly different to using a function. The read method belongs to the variable which holds the file object, so we write the name of the variable first, then a dot and the method name.

In [42]:

# open a file called input.txt in the current directory
file = open('input.txt')
# read the contents of the file and store them in a variable called contents
contents = file.read()
# print the contents
print(contents)

Martin Jones

In [43]:

# note that we have to print the variable that holds the contents of the file
# if we try to print the file we don't get what we want

file = open('input.txt')
print(file)

<_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>

In [44]:

# however, we can call read() and give it to print() right away
file = open('input.txt')
print(file.read())

Martin Jones

In [45]:

# we will get an error if we try to open a file that doesn't exist
file = open('missingfile.txt')

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
/home/martin/Dropbox/projects/python_course/<ipython-input-45-eed697387c19> in <module>()
      1 # we will get an error if we try to open a file that doesn't exist

----> 2 file = open('missingfile.txt')

IOError: [Errno 2] No such file or directory: 'missingfile.txt'

Writing to files

The open() function is actually more complicated than described above.

We can use a second argument, which is a string describing what we want to do to the file.

In this case, we want to write to the file, and we want it to be a text file, so we use the string “wt”

file = open('output.txt', 'wt')

There is a write() method that is similar to the read() method

file.write('Hello Martin')

When we have finished writing to a file, we call the close() method

file.close()

The close() method is interesting because it does not take any arguments, and we do not store the result. For small scripts, it’s not strictly necessary, but it is a good habit to get into.

In [46]:

# open an output file
file = open('output.txt', 'wt')
# write some text to it
file.write('Hello Martin')
# close the file
file.close()

In [47]:

# we can also append to an existing file by using a instead of w
file = open('output2.txt', 'at')
# every time we run this bit of code we will add a new line to the file
file.write("Hello again\n")
# the \n is a special combination of characters - it means start a new line
file.close()

In [48]:

# we can do all the string stuff we already know about in write()
first_name = 'Martin'
last_name = 'Jones'
file = open('output3.txt', 'wt')
file.write('hello ' + first_name + ' ' + last_name)
file.close()

A small program

Here is a program that reads a first name from one file and a last name from a second file. It then opens a third file and writes a greeting to it.

We need to use one more method in this program – rstrip(). This method removes characters from the right-hand end of a string. The argument is the character that we want to remove.

In this case we want to remove a newline from the end of the string so we call first_name.rstrip('\n'). Remember that ‘\n’ mean a newline.

If we left out the rstrip then we would have an unwanted newline in the middle of our output.

Rather than storing the result in a variable we use it straight away as part of the print statement.

Notice that

we use a different variable name for each file that we open
we have put in some blank lines to separate the different parts of the program
there is no connection between the name of the variable firstnamefile and the filename firstname.txt. We could replace all instances of firstnamefilewith banana and the script would work just as well (but it would be harder to read!)

The line that does the writing is a little bit difficult to read – we will learn a better way of doing this in a future session

In [49]:

# here we read a first and last name from two different files
# then write a greeting to a third file
first_name_file = open('firstname.txt')
first_name = first_name_file.read()

last_name_file = open('lastname.txt')
last_name = last_name_file.read()

output_file = open('fullname.txt', 'wt')
output_file.write('hello ' + first_name.rstrip('\n') + ' ' + last_name + "\n")
output_file.close()

Friday, December 7, 2012

draw a fig with R

> calpoints <- locator(n=4,type='p',pch=16,col='white',lwd=0.1)
> calpoints
$x
[1] 1.405103 2.325401 2.007167 2.927466

$y
[1] 1.9251696 1.9184308 -0.7501436 -0.7366660
> postscript(file="plot.eps", onefile=FALSE, horizontal=FALSE)
> setEPS("/Users/lilei/Desktop")
> postscript(file="aa.eps", onefile=FALSE, horizontal=FALSE)
> boxplot(golub[1042,] ~ gol.fac, xlim = c(0,5))
> arrows(1.405103,1.925169,2.325401,1.9184308,angle=30,code=1,col="red")
> arrows(2.007167,-0.7501436,2.927466,-0.7366660,angle=30,code=1,col="red")
> text(3.0,1.9184308,"Median")
> text(3.3,-0.7366660,"Outlier")
> dev.off()

par(mfrow=c(2,1)) divided the paper into 2 rows and one collums

locator() function in R

Calibration

So far, so good. The next step is to calibrate the graphic, by adding four calibration points of known coordinates. Because it is not always easy to know both coordinates of a point, we will use four calibration points. For the first pair, we will know the x position, and for the second pair, the y position. That allows us to place the points directly on the axis.

calpoints <- locator(n=4,type='p',pch=4,col='blue',lwd=2)

We can see the current value of the calibration points :

as.data.frame(calpoints)
          x        y
1 139.66429  73.8975
2 336.38388  73.8975
3  58.72237 167.0254
4  58.72237 328.1680

Point’n'click

The third step is to click each point, individually, in order to get the data.

data <- locator(type='p',pch=1,col='red',lwd=1.2,cex=1.2)

After clicking all the points, you should have the following graph :

Our data are, so far :

as.data.frame(data)
          x         y
1  104.8285  78.08303
2  138.6397 114.70636
3  171.4263 119.93826
4  205.2375 266.43158
5  238.0241 267.47796
6  270.8107 275.84901
7  302.5727 282.12729
8  336.3839 298.86939
9  370.1951 306.19405
10 401.9571 352.23481

OK, this is nearly what we want. What is left is just to write a function that will convert our data into the true coordinates.

evolution & bioinformatics