Thursday, May 14, 2015

Meng's Notes: Simple Enrichment Test -- calculate hypergeometric...

Meng's Notes: Simple Enrichment Test -- calculate hypergeometric...: Hypergeometric test are useful for enrichment analysis. For example, having a gene list in hand, people might want to tell which functions (...




















Wednesday, December 19, 2012


Simple Enrichment Test -- calculate hypergeometric p-values in R


Hypergeometric test are useful for enrichment analysis. For example,
having a gene list in hand, people might want to tell which functions
(GO terms) are enriched among these genes. Hypergeometric test (or its
equivalent: one-tailed
Fisher's exact test) will give you statistical confidence in p-values.



R software provids function phyper and fisher.test
for Hypergeometric and Fisher's exact test accordingly. However, it is
tricky to get it right. I spent some time to make it clear.





Here is a simple example:

Five cards were chosen from a well shuffled deck

x = the number of diamonds selected.

We use a 2x2 table to represent the case:



                Diamond     Non-Diamond

selected        x                     5-x               total 5 sampled cards

left               13-x                 34+x             total 47 left cards after sampling

                 13 Dia          39 Non-Dia         total 52 cards



We 're asking if diamond enriched or depleted in our selected cards, comparing to the background.




Here are the different parameters used by phyper and fisher.test:


phyper(x, 13, 39, 5, lower.tail=TRUE);
# Numerical parameters in order:
# (success-in-sample, success-in-bkgd, failure-in-bkgd, sample-size).
fisher.test(matrix(c(x, 13-x, 5-x, 34+x), 2, 2), alternative='less');
# Numerical parameters in order:
# (success-in-sample, success-in-left-part, failure-in-sample, failure-in-left-part).
It's obvious that hypergeometric test compares sample to bkgd, while
fisher's exact test compares sample to the left part of bkgd after
sampling without replacement. They will give the same p-value (because
they assume the same distribution).


Here is the results:

x=1; # x could be 0~5 
hitInSample 1  # could be 0~5
hitInPop 13 
failInPop 54-hitInPop 
sampleSize = 5
  • Test for under-representation (depletion)
phyper(hitInSample-1hitInPopfailInPopsampleSizelower.tailTRUE);
## [1] 0.6329532
fisher.test(matrix(c(hitInSamplehitInPop-hitInSamplesampleSize-hitInSamplefailInPop-sampleSize +hitInSample), 22), alternative='less')$p.value; 
## [1] 0.6329532
  • Test for over-representation (enrichment)
phyper(hitInSample-1hitInPopfailInPopsampleSizelower.tailFALSE);
## [1] 0.7784664
fisher.test(matrix(c(hitInSamplehitInPop-hitInSamplesampleSize-hitInSamplefailInPop-sampleSize +hitInSample), 22), alternative='greater')$p.value; 
## [1] 0.7784664
  •  Why hitInSample-1 when testing over-representation?
Because if lower.tail is TRUE (default), probabilities are
P[X ≤ x], otherwise, P[X > x]. We subtract x by 1, when P[X ≥ x] is needed.





So are there any advantages fisher.test has over phyper, as they give the same p-values?

Yes, fisher.test can do two other jobs: two-side test, and giving
confidence intervals of odds ratio. Please refer to its manual for
details. For one-side p-value calculating, they don't have any
difference if correct parameters were used.


No comments:

Post a Comment