Wednesday, December 19, 2012
Simple Enrichment Test -- calculate hypergeometric p-values in R
having a gene list in hand, people might want to tell which functions
(GO terms) are enriched among these genes. Hypergeometric test (or its
equivalent: one-tailed
Fisher's exact test) will give you statistical confidence in
R software provids function phyper and fisher.test
for Hypergeometric and Fisher's exact test accordingly. However, it is
tricky to get it right. I spent some time to make it clear.
Here is a simple example:
Five cards were chosen from a well shuffled deck
x = the number of diamonds selected.
We use a 2x2 table to represent the case:
Diamond Non-Diamond
selected x 5-x total 5 sampled cards
left 13-x 34+x total 47 left cards after sampling
13 Dia 39 Non-Dia total 52 cards
We 're asking if diamond enriched or depleted in our selected cards, comparing to the background.
Here are the different parameters used by phyper and fisher.test:
phyper(x, 13, 39, 5, lower.tail=TRUE);
# Numerical parameters in order:
# (success-in-sample, success-in-bkgd, failure-in-bkgd, sample-size).
fisher.test(matrix(c(x, 13-x, 5-x, 34+x), 2, 2), alternative='less');
# Numerical parameters in order:
# (success-in-sample, success-in-left-part, failure-in-sample, failure-in-left-part).It's obvious that hypergeometric test compares sample to bkgd, while
fisher's exact test compares sample to the left part of bkgd after
sampling without replacement. They will give the same p-value (because
they assume the same distribution).
Here is the results:
x=1; # x could be 0~5
hitInSample = 1 # could be 0~5
hitInPop = 13
failInPop = 54-hitInPop
sampleSize = 5
- Test for under-representation (depletion)
phyper(hitInSample-1, hitInPop, failInPop, sampleSize, lower.tail= TRUE);
## [1] 0.6329532
fisher.test(matrix(c(hitInSample, hitInPop-hitInSample, sampleSize-hitInSample, failInPop-sampleSize +hitInSample), 2, 2), alternative='less')$p.value;
## [1] 0.6329532
- Test for over-representation (enrichment)
phyper(hitInSample-1, hitInPop, failInPop, sampleSize, lower.tail= FALSE);
## [1] 0.7784664
fisher.test(matrix(c(hitInSample, hitInPop-hitInSample, sampleSize-hitInSample, failInPop-sampleSize +hitInSample), 2, 2), alternative='greater')$p.value;
## [1] 0.7784664
- Why hitInSample-1 when testing over-representation?
Becauseif
lower.tail
is TRUE (default), probabilities are
P[X ≤ x], otherwise, P[X > x]. We subtract x by 1, when P[X ≥ x] is needed.
So are there any advantages fisher.test has over phyper, as they give the same p-values?
Yes, fisher.test can do two other jobs: two-side test, and giving
confidence intervals of odds ratio. Please refer to its manual for
details. For one-side p-value calculating, they don't have any
difference if correct parameters were used.
No comments:
Post a Comment