evolution & bioinformatics: Meng's Notes: Simple Enrichment Test -- calculate hypergeometric...

Meng's Notes: Simple Enrichment Test -- calculate hypergeometric...: Hypergeometric test are useful for enrichment analysis. For example, having a gene list in hand, people might want to tell which functions (...

Wednesday, December 19, 2012

Simple Enrichment Test -- calculate hypergeometric p-values in R

Hypergeometric test are useful for enrichment analysis. For example,
having a gene list in hand, people might want to tell which functions
(GO terms) are enriched among these genes. Hypergeometric test (or its
equivalent: one-tailed
Fisher's exact test) will give you statistical confidence in

p-values.

R software provids function phyper and fisher.test
for Hypergeometric and Fisher's exact test accordingly. However, it is
tricky to get it right. I spent some time to make it clear.

Here is a simple example:

Five cards were chosen from a well shuffled deck

x = the number of diamonds selected.

We use a 2x2 table to represent the case:

                Diamond     Non-Diamond

selected        x                     5-x               total 5 sampled cards

left               13-x                 34+x             total 47 left cards after sampling

                 13 Dia        39 Non-Dia         total 52 cards

We 're asking if diamond enriched or depleted in our selected cards, comparing to the background.

Here are the different parameters used by phyper and fisher.test:

phyper(x, 13, 39, 5, lower.tail=TRUE);

# Numerical parameters in order:

# (success-in-sample, success-in-bkgd, failure-in-bkgd, sample-size).

fisher.test(matrix(c(x, 13-x, 5-x, 34+x), 2, 2), alternative='less');

# Numerical parameters in order:

# (success-in-sample, success-in-left-part, failure-in-sample, failure-in-left-part).

It's obvious that hypergeometric test compares sample to bkgd, while
fisher's exact test compares sample to the left part of bkgd after
sampling without replacement. They will give the same p-value (because
they assume the same distribution).

Here is the results:

x=1; # x could be 0~5

hitInSample = 1 # could be 0~5

hitInPop = 13

failInPop = 54-hitInPop

sampleSize = 5

Test for under-representation (depletion)

phyper(hitInSample-1, hitInPop, failInPop, sampleSize, lower.tail= TRUE);

## [1] 0.6329532

fisher.test(matrix(c(hitInSample, hitInPop-hitInSample, sampleSize-hitInSample, failInPop-sampleSize +hitInSample), 2, 2), alternative='less')$p.value;

## [1] 0.6329532

Test for over-representation (enrichment)

phyper(hitInSample-1, hitInPop, failInPop, sampleSize, lower.tail= FALSE);

## [1] 0.7784664

fisher.test(matrix(c(hitInSample, hitInPop-hitInSample, sampleSize-hitInSample, failInPop-sampleSize +hitInSample), 2, 2), alternative='greater')$p.value;

## [1] 0.7784664

Why hitInSample-1 when testing over-representation?

Because if lower.tail is TRUE (default), probabilities are
P[X ≤ x], otherwise, P[X > x]. We subtract x by 1, when P[X ≥ x] is needed.

So are there any advantages fisher.test has over phyper, as they give the same p-values?

Yes, fisher.test can do two other jobs: two-side test, and giving
confidence intervals of odds ratio. Please refer to its manual for
details. For one-side p-value calculating, they don't have any
difference if correct parameters were used.

evolution & bioinformatics

Thursday, May 14, 2015

Meng's Notes: Simple Enrichment Test -- calculate hypergeometric...

Wednesday, December 19, 2012

Simple Enrichment Test -- calculate hypergeometric p-values in R

No comments:

Post a Comment

About Me