Statistical Comparisons of Classifiers over Multiple Data Sets

Posted at 2025-07-15

Statistical Comparisons of Classifiers over Multiple Data Sets, Janez Dem
https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf

References

E. Alpaydın. Combined 5 × 2 F test for comparing supervised classification learning algorithms. Neural Computation, 11:1885–1892, 1999.
J. R. Beck and E. K. Schultz. The use of ROC curves in test performance evaluation. Arch Pathol Lab Med, 110:13–20, 1986.
R. Bellazzi and B. Zupan. Intelligent data analysis in medicine and pharmacology: a position statement. In IDAMAP Workshop Notes at the 13th European Conference on Artificial Intelligence, ECAI-98, Brighton, UK, 1998.
Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5:1089–1105, 2004.
C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html.
R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA. AAAI Press, 2003.
R. R. Bouckaert. Estimating replicability of classifier learning experiments. In C Brodley, editor, Machine Learning, Proceedings of the Twenty-First International Conference (ICML 2004). AAAI Press, 2004.
R. R. Bouckaert and E. Frank. Evaluating the replicability of significance tests for comparing learning algorithms. In D. Honghua, R. Srikant, and C. Zhang, editors, Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings. Springer, 2004.
P. B. Brazdil and C. Soares. A comparison of ranking methods for classification algorithm selection. In Proceedings of 11th European Conference on Machine Learning. Springer Verlag, 2000.
W. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74:329–336, 1979.
J. Cohen. The earth is round (p < .05). American Psychologist, 49:997 1003, 1994.
28 STATISTICAL COMPARISONS OF CLASSIFIERS OVER MULTIPLE DATA SETS
J. Demˇ sar and B. Zupan. Orange: From Experimental Machine Learning to Interactive Data Mining, A White Paper. Faculty of Computer and Information Science, Ljubljana, Slovenia, 2004.
T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998.
O. J. Dunn. Multiple comparisons among means. Journal of the American Statistical Association, 56:52–64, 1961.
C. W. Dunnett. A multiple comparison procedure for comparing several treatments with a control. Journal of American Statistical Association, 50:1096–1121, 1980.
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029, Chambery, France, 1993. Morgan-Kaufmann.
R. A. Fisher. Statistical methods and scientific inference (2nd edition). Hafner Publishing Co., New York, 1959.
M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937.
M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11:86–92, 1940.
L. C. Hamilton. Modern Data Analysis: A First Course in Applied Statistics. Wadsworth, Belmont, California, 1990.
L. L. Harlow and S. A. Mulaik, editors. What If There Were No Significance Tests? Erlbaum Associates, July 1997. Lawrence
Y. Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75:800–803, 1988.
B. Holland. On the application of three modified Bonferroni procedures to pairwise multiple comparisons in balanced repeated measures designs. Computational Statistics Quarterly, 6:219–231, 1991.
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65–70, 1979.
G. Hommel. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75:383–386, 1988.
D. A. Hull. Information Retrieval Using Statistical Classification. PhD thesis, Stanford University, November 1994.
R. L. Iman and J. M. Davenport. Approximations of the critical region of the Friedman statistic. Communications in Statistics, pages 571–595, 1980.
P. Langley. Crafting papers on machine learning. In Proc. of Seventeenth International Conference on Machine Learning (ICML-2000), 2000.
D. Mladeni´ c and M. Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In I. Bratko and S. Dˇ zeroski, editors, Machine Learning, Proceedings of the Sixteenth International Conference (ICML 1999), June 27-30, 2002, Bled, Slovenia, pages 258–267. Morgan Kaufmann, 1999.
C. Nadeau and Y. Bengio. Inference for the generalization error. Advances in Neural Information
Processing Systems, 12:239–281, 2000.
P. B. Nemenyi. Distribution-free multiple comparisons. PhD thesis, Princeton University, 1963.
J. Pizarro, E. Guerrero, and P. L. Galindo. Multiple comparison procedures applied to model selection. Neurocomputing, 48:155–173, 2002.
F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In J. Shavlik, editor, Proceedings of the Fifteenth International Conference onMachine Learning (ICML-1998), pages 445–453, San Francisco, CA, 1998. Morgan Kaufmann Publishers.
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. Thirteenth National Conference on Artificial Intelligence, pages 725–730, Portland, OR, 1996. AAAI Press.
S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:317–328, 1997.
F. L. Schmidt. Statistical significance testing and cumulative knowledge in psychology. Psychological Methods, 1:115–129, 1996.
H. Sch¨ utze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, and R. Fidel, editors, SIGIR’95, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 229–237. ACM Press, 1995.
J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46:561–584, 1995.
D. J. Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman & Hall/CRC, 2000.
J. W. Tukey. Comparing individual means in the analysis of variance. Biometrics, 5:99–114, 1949.
E. G. V´ azquez, A. Y. Escolano, and J. P. Junquera P. G. Ria˜ no. Repeated measures multiple comparison procedures applied to model selection in neural networks. In Proc. of the 6th Intl. Conf. On Artificial and Natural Neural Networks (IWANN 2001), pages 88–95, 2001.
G. I. Webb. Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40:159–197, 2000.
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.
J. H. Zar. Biostatistical Analysis (4th Edition). Prentice Hall, Englewood Clifs, New Jersey, 1998.

Related Notes

Strategy to some summary and propose to AUTOSAR, SDV and Software Factory.
https://qiita.com/kaizen_nagoya/items/d36a9eba629022276918

AUTOSAR paper on arXiv
https://qiita.com/kaizen_nagoya/items/7c2355275ec53dbdc7a2

AUTOSAR on arXiv reference paper summary
https://qiita.com/kaizen_nagoya/items/aa639a268aaa56c5a501

Yet another strategy for papers on Software Engineering.
https://qiita.com/kaizen_nagoya/items/d186dfcb4d4f7cd1dcf1

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up