Data arranged in a two-way contingency table can be obtained as a result of many experiments in the life sciences. In some cases the categorized trait is in fact conditioned by an unobservable continuous variable, called liability. It may be interesting to know the relationship between the Pearson correlation coefficient of these two continuous variables and the entropy function measuring the corresponding relation for categorized data. After many simulation trials, a linear regression was estimated between the Pearson correlation coefficient and the normalized mutual information (both on a logarithmic scale). It was observed that the regression coefficients obtained do not depend either on the number of observations classified on a categorical scale or on the continuous random distribution used for the latent variable, but they are influenced by the number of columns in the contingency table. In this paper a known measure of dependency for such data, based on the entropy concept, is applied.
Bilow M., Crespo F., Pan Z., Eskin E., Eyheramendy S. (2017): Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS. Genetics 205: 1041-1047.
Bakinowska E., Kala R. (2007): An application of logistic models for comparison of varieties of seed pea with respect to lodging. Biometrical Letters 44(2): 143-154.
Dobek A., Steppa R., Moliński K., Ślósarz P. (2013): Use of entropy in the analysis of nominal traits in sheep. Journal of Applied Genetics 54: 97-102.
Dobek A., Szydłowski M., Szwaczkowski T., Skotarczak E., Moliński K. (2003): Bayesian estimates of genetic variance of fertility and hatchability under a threshold animal model. Journal of Animal and Feed Sciences 12: 307-314.
Gianola D., Foulley J.L. (1983): Sire evaluation for ordered categorical data with a threshold model. Genetics Selection Evolution 15: 201-224.
Harville D.A, Mee R.W. (1984): A mixed model procedure for analyzing ordered categorical data. Biometrics 40: 393–408.
Jakulin A. (2005): Machine learning based on attribute interactions. PhD dissertation. University of Ljubljana.
Joe H. (1989): Relative entropy measures of multivariate dependence. Journal of the American Statistical Association 84(405): 157-164.
Kang G., Yue W., Zhang J., Cui Y., Zuo Y., Zhang D. (2008): An entropy-based approach for testing genetic epistasis underlying complex diseases. Journal of Theoretical Biology 250: 362-374.
McCullagh P., Nelder J.A. (1989): Generalized linear models. Chapman and Hall/CRC.
Moliński K., Dobek A., Tomaszyk K. (2012): The use of information and information gain in the analysis of attribute dependencies. Biometrical Letters 49(2): 149-158.
Moliński K., Szydłowski M., Szwaczkowski T., Dobek A., Skotarczak E. (2003): An algorithm for genetic variance estimation of reproductive traits under a threshold model. Archives Animal Breeding 46: 85-91.
Moore J.H., Gilbert J.C., Tsai C.T., Chiang F.T., Holden T., Barney N., White B.C. (2006): A flexible computational framework for detecting, characterizing and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology 241: 252-261.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Ruiz-Marin M., Matilla Garcia M., Cordoba J.A.G., Susillo-Gonzalez J.L., Romo-Astorga A., Gonzalez-Perez A., Ruiz A., Gayan J. (2010): An entropy test for single-locus genetic association analysis. BMC Genetics 11(19).
Shannon C.E. (1948): A mathematical theory of communication. The Bell System Technical Journal (27): 379-423, 623-656.
Snell E.J. (1964): A scaling procedure for ordered categorical data. Biometrics (20): 592-607.
Yan Z., Wang Z., Xie H. (2008): The application of mutual information-based feature selection and fuzzy LS-SVM-based classifier in motion classification. Computer Methods and Programs in Biomedicine (90): 275-284.