睡眠障碍是什么原因引起的| cea是什么检查项目| 灵官爷是什么神| 耳朵发烧是什么原因| 沉脉是什么意思| 宫颈液基细胞学检查是什么| 便秘吃什么药能根治| 蜂窝织炎用什么抗生素| 叶酸补什么| 神态是什么意思| 小跟班是什么意思| 蜻蜓喜欢吃什么| 晚上八点到九点是什么时辰| 眼睛胀痛什么原因| 泰州有什么好玩的地方| 绿是什么| 右边偏头痛是什么原因| 什么牌子奶粉最好| 动脉硬化吃什么可以软化血管| 经常打哈欠是什么原因| 毛遂自荐是什么意思| 2020是什么年| 相濡以沫不如相忘于江湖是什么意思| 梦见好多人是什么意思| 衣服发黄是什么原因| 2028年属什么生肖| 螺旋幽门杆菌吃什么药治疗好| 司马懿字什么| 猫咪掉胡子是什么原因| 老汉是什么意思| 唇炎涂什么药膏| 鱼腥草破壁饮片有什么功效| 三无产品指的是什么| 孕妇吃香蕉对胎儿有什么好处| 伤口溃烂不愈合用什么药| 芡实和什么搭配最好| 炒面用什么面条最好| 少叙痣是什么意思| 大便是黑色的是什么原因| 双子座男和什么座最配对| 一什么之| 向内求什么意思| 身上长红痣是什么原因| hl是胎儿的什么| Joyce什么意思| 紧急避孕药有什么危害| 喝紫苏水有什么功效| 眼睛干涩用什么药效果好| 貔貅和麒麟有什么区别| 四月十九是什么星座| 尿毒症可以吃什么水果| 情有独钟是什么意思| 白起为什么被赐死| 颧骨疼是什么原因| 白带是什么样子的| 锦纶是什么面料优缺点| 女人吃鹅蛋有什么好处| 认生是什么意思| 蜡笔小新的爸爸叫什么| 丁胺卡那又叫什么药名| 大麦和小麦有什么区别| 智能手环什么品牌好| oem贴牌是什么意思| 什么是特殊膳食| 抽风是什么意思| 脚突然肿了是什么原因| cm是什么| 神经性头疼吃什么药好| 肠粉为什么叫肠粉| 君臣佐使是什么意思| 什么叫桥本甲状腺炎| 什么是溶血性贫血| 胃饱胀是什么原因| 跳跳糖为什么会跳| 支原体感染是什么引起的| 梦见别人给钱是什么意思| 唇系带短有什么影响| 朱元璋代表什么生肖| 吃什么提高免疫力最好最快| 李嘉诚属什么生肖| 牡丹王是什么茶| 毛孔粗大做什么医美| 芝士是什么材料做的| m是什么尺码| 复出是什么意思| 云为什么是白色的| 姜枣茶什么季节喝最好| 失眠是什么| 女性阳性是什么病| 遗忘的遗是什么意思| 为什么老是想吐| 喝什么茶去湿气| 朱砂是什么东西| 窗户代表什么生肖| 三个句号代表什么意思| 慢性结肠炎用什么药| 五谷指什么| 氨糖是什么| 芒果可以做什么美食| 教唆是什么意思| 孕妇上火什么降火最快| 爱因斯坦是什么星座| 大脑镰钙化灶是什么意思| 黄连泡水喝有什么功效| 尿多什么原因| 炖牛肉放什么| 左什么右什么| 吃什么药能延迟射精| 门槛费是什么意思| 什么叫关税| 甘油三酯高是什么原因| 水痘长什么样| 小拇指旁边的手指叫什么| 胎盘老化对胎儿有什么影响| 六月二号什么星座| 长辈生日送什么花| 睡眠时间短是什么原因| 兵员预征是什么意思| 电轴左偏是什么意思| 什么网站可以看黄片| 拿什么拯救你我的爱人演员表| 洁字五行属什么| 睾丸变小是什么原因| 介入医学科是什么科室| 阴道有豆腐渣用什么药| 不骄不躁是什么意思| 激光脱毛挂什么科| 双鱼座是什么性格| 公务员是做什么的| 小孩肺炎吃什么药| 君子兰叶子发黄是什么原因| 什么是人彘| ACEI是什么药| 胃受凉了吃什么药| 透析是什么意思啊| 回流是什么意思| 脚后跟骨头疼是什么原因| 眼睛飞蚊症用什么眼药水| 菱角什么时候上市| 飞地是什么意思| 男性囊肿是什么引起的| 当我们谈论爱情时我们在谈论什么| ck是什么品牌| 印度的全称是什么| u盘什么牌子好| 一直打哈欠是什么原因| y代表什么意思| 血糖高喝什么茶| 黄皮适合什么颜色的衣服| 我国最早的中医学专著是什么| 油性记号笔用什么能擦掉| 宝宝干呕是什么原因| 紫烟是什么意思| 异曲同工是什么意思| 吃葡萄有什么好处| 梅毒挂什么科| 偷鸡不成蚀把米是什么生肖| 短纤是什么| 789是什么意思| 白眼狼是什么意思| 七一年属什么生肖| 月球是地球的什么| 包馄饨用猪肉什么部位| 前凸后翘什么意思| 草酸是什么| 八六年属什么生肖| 鹅梨帐中香是什么| 什么是屈光不正| 丝瓜有什么营养| 胃疼吃什么饭| 足跟疼痛用什么药| 发泥和发蜡有什么区别| 精子对女性有什么好处| 什么泡水喝可以降血糖| 锻炼pc肌有什么好处| 地球上什么东西每天要走的距离最远| 福不唐捐什么意思| 警察为什么叫条子| 什么是美尼尔氏综合症| 宫腔内偏强回声是什么意思| 蝉联是什么意思| 食管炎有什么症状| 阳绿翡翠属于什么级别| bu什么颜色| 子午相冲是什么意思| 啪啪啪是什么意思| 风调雨顺的下联是什么| 摇花手是什么意思| 四十年婚姻是什么婚| kai是什么意思| 武汉什么省| 18点是什么时辰| 什么人不能喝桑黄| 卧底归来大结局是什么| 牙周炎用什么药最见效| 剖腹产第四天可以吃什么| 安乐死是什么| 女性潮热是什么症状| 金开什么字| 吃什么容易排便| 蚊子会传染什么病| a股是什么意思| 月子中心需要什么资质| 四方草地是什么生肖| 河粉是什么材料做的| 什么是单核细胞百分比| 梦到插秧是什么意思| 喉咙沙哑吃什么药| 中间细胞百分比偏高是什么意思| 飞蚊症用什么药物治疗最好| 包皮炎是什么症状| 胆红素偏高挂什么科| 甲状腺炎吃什么药| 为什么会突然不爱了| 肺在五行中属什么| 强项是什么意思| 不来月经吃什么药| 经期便秘是什么原因| 过期的牛奶有什么用| 做梦吃饺子是什么意思| 眼睛有点黄是什么原因| 泡椒是什么辣椒| 膀胱在什么位置图片| 孺子可教也什么意思| 老年脑是什么病| 白头发吃什么维生素| 男性下焦湿热吃什么药| 下午4点半是什么时辰| 人参果不能和什么一起吃| 独生子女证有什么用| 查传染病四项挂什么科| 喜欢喝冰水是什么原因| 百思不得其解是什么意思| 女予念什么| 鸡是什么命| 眼睛肿了是什么原因| 左后脑勺疼是什么原因| 卵泡不破是什么原因造成的| 肾囊肿有什么症状| 五六点是什么时辰| 偏执是什么意思| 黄色加蓝色等于什么颜色| 快照是什么意思| 频繁打喷嚏是什么原因| 宝宝睡眠不好是什么原因| 今年77岁属什么生肖| 继往开来是什么意思| 脚气用什么洗脚| 四面楚歌是什么生肖| ppd是什么| 做胃镜前喝的那个液体是什么| 西西里的美丽传说讲的什么| 丑小鸭告诉我们一个什么道理| 是谁在敲打我窗是什么歌| 五月二十九是什么星座| 综合用地是什么性质| 搀扶什么意思| 刑冲破害是什么意思| 喝椰子粉有什么好处| cpi指数上涨意味着什么| 月子吃什么补气血| 乙醇和酒精有什么区别| 大便次数多什么原因| 高大的什么| 百度Jump to content

万山:谋划“金点子”破解村集体经济“空壳村”

From Wikipedia, the free encyclopedia
(Redirected from Multiple comparisons)
An example of coincidence produced by data dredging (uncorrected multiple comparisons) showing a correlation between the number of letters in a spelling bee's winning word and the number of people in the United States killed by venomous spiders. Given a large enough pool of variables for the same time period, it is possible to find a pair of graphs that show a spurious correlation.
百度   就与马克龙的通话,特雷莎·梅办公室称,双方讨论了“脱欧”谈判取得的进展,展望了即将举行的欧盟峰会。

Multiple comparisons, multiplicity or multiple testing problem occurs in statistics when one considers a set of statistical inferences simultaneously[1] or estimates a subset of parameters selected based on the observed values.[2]

The larger the number of inferences made, the more likely erroneous inferences become. Several statistical techniques have been developed to address this problem, for example, by requiring a stricter significance threshold for individual comparisons, so as to compensate for the number of inferences being made. Methods for family-wise error rate give the probability of false positives resulting from the multiple comparisons problem.

History

[edit]

The problem of multiple comparisons received increased attention in the 1950s with the work of statisticians such as Tukey and Scheffé. Over the ensuing decades, many procedures were developed to address the problem. In 1996, the first international conference on multiple comparison procedures took place in Tel Aviv.[3] This is an active research area with work being done by, for example Emmanuel Candès and Vladimir Vovk.

Definition

[edit]
Production of a small p-value by multiple testing.
30 samples of 10 dots of random color (blue or red) are observed. On each sample, a two-tailed binomial test of the null hypothesis that blue and red are equally probable is performed. The first row shows the possible p-values as a function of the number of blue and red dots in the sample.
Although the 30 samples were all simulated under the null, one of the resulting p-values is small enough to produce a false rejection at the typical level 0.05 in the absence of correction.

Multiple comparisons arise when a statistical analysis involves multiple simultaneous statistical tests, each of which has a potential to produce a "discovery". A stated confidence level generally applies only to each test considered individually, but often it is desirable to have a confidence level for the whole family of simultaneous tests.[4] Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples:

  • Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random sampling error alone.
  • Suppose we consider the efficacy of a drug in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes increasingly likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom.

In both examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison.

For example, if one test is performed at the 5% level and the corresponding null hypothesis is true, there is only a 5% risk of incorrectly rejecting the null hypothesis. However, if 100 tests are each conducted at the 5% level and all corresponding null hypotheses are true, the expected number of incorrect rejections (also known as false positives or Type I errors) is 5. If the tests are statistically independent from each other (i.e. are performed on independent samples), the probability of at least one incorrect rejection is approximately 99.4%.

The multiple comparisons problem also applies to confidence intervals. A single confidence interval with a 95% coverage probability level will contain the true value of the parameter in 95% of samples. However, if one considers 100 confidence intervals simultaneously, each with 95% coverage probability, the expected number of non-covering intervals is 5. If the intervals are statistically independent from each other, the probability that at least one interval does not contain the population parameter is 99.4%.

Techniques have been developed to prevent the inflation of false positive rates and non-coverage rates that occur with multiple statistical tests.

Classification of multiple hypothesis tests

[edit]

The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: H1H2, ..., Hm. Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if the test is non-significant. Summing each type of outcome over all Hi  yields the following random variables:

Null hypothesis is true (H0) Alternative hypothesis is true (HA) Total
Test is declared significant V S R
Test is declared non-significant U T
Total m

In m hypothesis tests of which are true null hypotheses, R is an observable random variable, and S, T, U, and V are unobservable random variables.

Controlling procedures

[edit]
Probability that at least one null hypothesis is wrongly rejected, for , as a function of the number of independent tests .

Multiple testing correction

[edit]

Multiple testing correction refers to making statistical tests more stringent in order to counteract the problem of multiple testing. The best known such adjustment is the Bonferroni correction, but other methods have been developed. Such methods are typically designed to control the family-wise error rate or the false discovery rate.

If m independent comparisons are performed, the family-wise error rate (FWER), is given by

Hence, unless the tests are perfectly positively dependent (i.e., identical), increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say:

which follows from Boole's inequality. Example:

There are different ways to assure that the family-wise error rate is at most . The most conservative method, which is free of dependence and distributional assumptions, is the Bonferroni correction . A marginally less conservative correction can be obtained by solving the equation for the family-wise error rate of independent comparisons for . This yields , which is known as the ?idák correction. Another procedure is the Holm–Bonferroni method, which uniformly delivers more power than the simple Bonferroni correction, by testing only the lowest p-value () against the strictest criterion, and the higher p-values () against progressively less strict criteria.[5] .

For continuous problems, one can employ Bayesian logic to compute from the prior-to-posterior volume ratio. Continuous generalizations of the Bonferroni and ?idák correction are presented in.[6]

Large-scale multiple testing

[edit]

Traditional methods for multiple comparisons adjustments focus on correcting for modest numbers of comparisons, often in an analysis of variance. A different set of techniques have been developed for "large-scale multiple testing", in which thousands or even greater numbers of tests are performed. For example, in genomics, when using technologies such as microarrays, expression levels of tens of thousands of genes can be measured, and genotypes for millions of genetic markers can be measured. Particularly in the field of genetic association studies, there has been a serious problem with non-replication — a result being strongly statistically significant in one study but failing to be replicated in a follow-up study. Such non-replication can have many causes, but it is widely considered that failure to fully account for the consequences of making multiple comparisons is one of the causes.[7] It has been argued that advances in measurement and information technology have made it far easier to generate large datasets for exploratory analysis, often leading to the testing of large numbers of hypotheses with no prior basis for expecting many of the hypotheses to be true. In this situation, very high false positive rates are expected unless multiple comparisons adjustments are made.

For large-scale testing problems where the goal is to provide definitive results, the family-wise error rate remains the most accepted parameter for ascribing significance levels to statistical tests. Alternatively, if a study is viewed as exploratory, or if significant results can be easily re-tested in an independent study, control of the false discovery rate (FDR)[8][9][10] is often preferred. The FDR, loosely defined as the expected proportion of false positives among all significant tests, allows researchers to identify a set of "candidate positives" that can be more rigorously evaluated in a follow-up study.[11]

The practice of trying many unadjusted comparisons in the hope of finding a significant one is a known problem, whether applied unintentionally or deliberately, is sometimes called "p-hacking".[12][13]

Assessing whether any alternative hypotheses are true

[edit]
A normal quantile plot for a simulated set of test statistics that have been standardized to be Z-scores under the null hypothesis. The departure of the upper tail of the distribution from the expected trend along the diagonal is due to the presence of substantially more large test statistic values than would be expected if all null hypotheses were true. The red point corresponds to the fourth largest observed test statistic, which is 3.13, versus an expected value of 2.06. The blue point corresponds to the fifth smallest test statistic, which is -1.75, versus an expected value of -1.96. The graph suggests that it is unlikely that all the null hypotheses are true, and that most or all instances of a true alternative hypothesis result from deviations in the positive direction.

A basic question faced at the outset of analyzing a large set of testing results is whether there is evidence that any of the alternative hypotheses are true. One simple meta-test that can be applied when it is assumed that the tests are independent of each other is to use the Poisson distribution as a model for the number of significant results at a given level α that would be found when all null hypotheses are true.[citation needed] If the observed number of positives is substantially greater than what should be expected, this suggests that there are likely to be some true positives among the significant results.

For example, if 1000 independent tests are performed, each at level α = 0.05, we expect 0.05 × 1000 = 50 significant tests to occur when all null hypotheses are true. Based on the Poisson distribution with mean 50, the probability of observing more than 61 significant tests is less than 0.05, so if more than 61 significant results are observed, it is very likely that some of them correspond to situations where the alternative hypothesis holds. A drawback of this approach is that it overstates the evidence that some of the alternative hypotheses are true when the test statistics are positively correlated, which commonly occurs in practice. [citation needed]. On the other hand, the approach remains valid even in the presence of correlation among the test statistics, as long as the Poisson distribution can be shown to provide a good approximation for the number of significant results. This scenario arises, for instance, when mining significant frequent itemsets from transactional datasets. Furthermore, a careful two stage analysis can bound the FDR at a pre-specified level.[14]

Another common approach that can be used in situations where the test statistics can be standardized to Z-scores is to make a normal quantile plot of the test statistics. If the observed quantiles are markedly more dispersed than the normal quantiles, this suggests that some of the significant results may be true positives.[citation needed]

See also

[edit]
Key concepts
General methods of alpha adjustment for multiple comparisons
Related concepts

References

[edit]
  1. ^ Miller, R.G. (1981). Simultaneous Statistical Inference 2nd Ed. Springer Verlag New York. ISBN 978-0-387-90548-8.
  2. ^ Benjamini, Y. (2010). "Simultaneous and selective inference: Current successes and future challenges". Biometrical Journal. 52 (6): 708–721. doi:10.1002/bimj.200900299. PMID 21154895. S2CID 8806192.
  3. ^ "Home". mcp-conference.org.
  4. ^ Kutner, Michael; Nachtsheim, Christopher; Neter, John; Li, William (2005). Applied Linear Statistical Models. McGraw-Hill Irwin. pp. 744–745. ISBN 9780072386882.
  5. ^ Aickin, M; Gensler, H (May 1996). "Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods". Am J Public Health. 86 (5): 726–728. doi:10.2105/ajph.86.5.726. PMC 1380484. PMID 8629727.
  6. ^ Bayer, Adrian E.; Seljak, Uro? (2020). "The look-elsewhere effect from a unified Bayesian and frequentist perspective". Journal of Cosmology and Astroparticle Physics. 2020 (10): 009. arXiv:2007.13821. Bibcode:2020JCAP...10..009B. doi:10.1088/1475-7516/2020/10/009. S2CID 220830693.
  7. ^ Qu, Hui-Qi; Tien, Matthew; Polychronakos, Constantin (2025-08-05). "Statistical significance in genetic association studies". Clinical and Investigative Medicine. 33 (5): E266 – E270. ISSN 0147-958X. PMC 3270946. PMID 20926032.
  8. ^ Benjamini, Yoav; Hochberg, Yosef (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing". Journal of the Royal Statistical Society, Series B. 57 (1): 125–133. JSTOR 2346101.
  9. ^ Storey, JD; Tibshirani, Robert (2003). "Statistical significance for genome-wide studies". PNAS. 100 (16): 9440–9445. Bibcode:2003PNAS..100.9440S. doi:10.1073/pnas.1530509100. JSTOR 3144228. PMC 170937. PMID 12883005.
  10. ^ Efron, Bradley; Tibshirani, Robert; Storey, John D.; Tusher, Virginia (2001). "Empirical Bayes analysis of a microarray experiment". Journal of the American Statistical Association. 96 (456): 1151–1160. doi:10.1198/016214501753382129. JSTOR 3085878. S2CID 9076863.
  11. ^ Noble, William S. (2025-08-05). "How does multiple testing correction work?". Nature Biotechnology. 27 (12): 1135–1137. doi:10.1038/nbt1209-1135. ISSN 1087-0156. PMC 2907892. PMID 20010596.
  12. ^ Young, S. S., Karr, A. (2011). "Deming, data and observational studies" (PDF). Significance. 8 (3): 116–120. doi:10.1111/j.1740-9713.2011.00506.x.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  13. ^ Smith, G. D., Shah, E. (2002). "Data dredging, bias, or confounding". BMJ. 325 (7378): 1437–1438. doi:10.1136/bmj.325.7378.1437. PMC 1124898. PMID 12493654.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  14. ^ Kirsch, A; Mitzenmacher, M; Pietracaprina, A; Pucci, G; Upfal, E; Vandin, F (June 2012). "An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets". Journal of the ACM. 59 (3): 12:1–12:22. arXiv:1002.1104. doi:10.1145/2220357.2220359.

Further reading

[edit]
  • F. Bretz, T. Hothorn, P. Westfall (2010), Multiple Comparisons Using R, CRC Press
  • S. Dudoit and M. J. van der Laan (2008), Multiple Testing Procedures with Application to Genomics, Springer
  • Farcomeni, A. (2008). "A Review of Modern Multiple Hypothesis Testing, with particular attention to the false discovery proportion". Statistical Methods in Medical Research. 17 (4): 347–388. doi:10.1177/0962280206079046. hdl:11573/142139. PMID 17698936. S2CID 12777404.
  • Phipson, B.; Smyth, G. K. (2010). "Permutation P-values Should Never Be Zero: Calculating Exact P-values when Permutations are Randomly Drawn". Statistical Applications in Genetics and Molecular Biology. 9: Article39. arXiv:1603.05766. doi:10.2202/1544-6115.1585. PMID 21044043. S2CID 10735784.
  • P. H. Westfall and S. S. Young (1993), Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment, Wiley
  • P. Westfall, R. Tobias, R. Wolfinger (2011) Multiple comparisons and multiple testing using SAS, 2nd edn, SAS Institute
  • A gallery of examples of implausible correlations sourced by data dredging
  • [1] An xkcd comic about the multiple comparisons problem, using jelly beans and acne as an example
中暑了吃什么药 肺结节不能吃什么食物 朝是什么意思 上下眼皮肿是什么原因 中央党校校长是什么级别
你为什么不说话歌词 囊肿与肿瘤有什么区别 小腿疼痛挂什么科 东北小咬是什么虫子 什么是肠镜检查
怀孕第一天有什么症状 ca代表什么病 黑无常叫什么 什么的脚 蝶变是什么意思
怀孕十天有什么反应 机器灵砍菜刀是什么意思 ex是什么的缩写 开市是什么意思 拔完智齿需要注意什么
sport什么牌子hcv8jop2ns1r.cn 电视为什么打不开yanzhenzixun.com 白牌车是什么身份hcv8jop3ns3r.cn 柠檬水什么时候喝最好baiqunet.com 47是什么生肖hcv9jop7ns0r.cn
fcm是什么意思hcv8jop0ns4r.cn 荨麻疹涂什么药膏hcv8jop9ns6r.cn 阴唇内侧长疙瘩是什么原因tiangongnft.com 晚上睡觉牙齿出血是什么原因hcv7jop9ns5r.cn 霸王花煲汤放什么材料hcv9jop5ns1r.cn
高密度脂蛋白低是什么原因helloaicloud.com 赟怎么读 什么意思hcv8jop6ns5r.cn 鸡爪烧什么好吃hcv8jop8ns1r.cn 乏力是什么症状hcv9jop2ns2r.cn 下面有味道用什么药hcv8jop1ns8r.cn
口扫是什么zhiyanzhang.com 张辽字什么hcv9jop8ns1r.cn 下眼睑红肿是什么原因hcv9jop3ns2r.cn 什么是绘本hcv9jop5ns8r.cn 白炽灯属于什么光源hcv8jop3ns6r.cn
百度