Impact of sample size on principal component analysis ordination of an environmental data set: effects on eigenstructure

S. Shahid Shaukat 1 , Toqeer Ahmed Rao 2  and Moazzam A. Khan 1
  • 1 Institute of Environmental Studies, University of Karachi, Karachi-75270, Pakistan
  • 2 Department of Botany, Federal Urdu University of Arts, Sciences & Technology, Karachi-75300, Pakistan

Abstract

In this study, we used bootstrap simulation of a real data set to investigate the impact of sample size (N = 20, 30, 40 and 50) on the eigenvalues and eigenvectors resulting from principal component analysis (PCA). For each sample size, 100 bootstrap samples were drawn from environmental data matrix pertaining to water quality variables (p = 22) of a small data set comprising of 55 samples (stations from where water samples were collected). Because in ecology and environmental sciences the data sets are invariably small owing to high cost of collection and analysis of samples, we restricted our study to relatively small sample sizes. We focused attention on comparison of first 6 eigenvectors and first 10 eigenvalues. Data sets were compared using agglomerative cluster analysis using Ward’s method that does not require any stringent distributional assumptions.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Anderson, M.J. & Wilis T.J. (2003). Canonical analysis of principal coordinates: a useful method of constrained ordination for ecology. Ecology, 84, 511–525. DOI: 10.1890/0012-9658(2003)084[0511:CAOPCA]2.0.CO;2.

  • APHA, (1992). Standard methods for the examination of water and waste water. American Washington: Public Health Association.

  • Bandalos, D.L. & Boehm-Kaufman M.R. (2009). Four common misconceptions in exploratory factor analysis. In C.E. Lance & R.J. Vandenberg (Eds.), Statistical and methodological myths and urban legends (pp. 61–87). New York: Routledge Publisher.

  • Barrett, P.T. & Kline P. (1981). The observation to variable ratio in factor analysis. Personality Study and Group Behaviour, 1, 23−33.

  • Bray, J.R. & Curtis J.T. (1957). An ordination of the upland forest communities of Southern Wisconsin. Ecol. Monogr., 27, 325–349. DOI: 10.2307/1942268.

  • Bryant, F.B. & Yarnold P.R. (1995). Principal components analysis and exploratory and confirmatory factor analysis. In L.G. Grimm & R.R. Yarnold (Eds.), Reading and understanding multivariate statistics (pp. 99−136). Washington: American Psycholgical Association.

  • Burd, B.J.A., Nemec, A. & Brinkhurst R.O. (1990). The development and application of analytical methods in benthic marine faunal studies. Adv. Mar. Biol., 26, 169−247. DOI: 10.1016/S0065-2881(08)60201-1.

  • Cadima, J. & Jolliffe I.T. (1995). Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics, 22, 203−214. DOI: 10.1080/757584614.

  • Cattell, R.B. (1966). The Scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276. DOI: 10.1207/s15327906mbr0102_10.

  • Cattell, R.B. (1978). The scientific use of factor analysis in behavioral and life sciences. New York: Plenum Press.

  • Chateau, F. & Lebart L. (1996). Assessing sample variability in the visualization techniques related to principal component analysis: Bootstrap and alternative simulation methods. In A. Prats (Ed.), Proceedings of COMPSTAT 2006. Heidelberg: Physica Verlag.

  • Chatfield, C. & Collins A.J. (1980). Introduction to multivariate analysis. London, New York: Chapman & Hall.

  • Comrey, A.L. & Lee H.B. (1992). A first course in factor analysis. London: Taylor and Francis.

  • de Winter, J.C.F., Dodou, D. & Wieringa P.A. (2009). Exploratory factor analysis with small sample sizes. Multivariate Behavioral Research, 44, 147−181. DOI: 10.1080/00273170902794206.

  • Dengler, J., Lobel, S. & Dolnik C. (2009). Species constancy depends on plot size a problem for vegetation classification and how it can be solved. J. Veg. Sci., 20, 754−766. DOI: 10.1111/j.1654-1103.2009.01073.x.

  • Diaconis, P. & Efron B. (1983). Computer-intensive methods in statistics. Sci. Am., 248, 116−130. doi:10.1038/scientificamerican0583-116

  • Dochtermann, N.A. & Jenkins S.H. (2011). Multivariate methods and small sample sizes. Ethology, 117, 95−101. DOI: 10.1111/j.1439-0310.2010.01846.x.

  • Fasham, M.J.R. (1977). The comparison of nonmetric multidimensional scaling, principal component analysis and reciprocal averaging for the ordination of simulated coenocline and coenoplanes. Ecology, 58, 551−561. DOI: 10.2307/1939004

  • Forcino, F.L. (2012). Multivariate assessment of the required sample size for community paleoecological research. Palaeogeo. Palaeoclimatol. Palaeoecol., 315−316, 134−141. DOI: 10.1016/j.palaeo.2011.11.019.

  • Gamito, S. & Raffaelli D. (1992). The sensitivity of several ordination methods to sample replication in benthic surveys. J. Exp. Mar. Biol. Ecol., 164, 221−232. DOI: 10.1016/0022-0981(92)90176-B.

  • Gauch, H.G. & Whittaker R.H. (1972). Comparison of ordination techniques. Ecology, 53, 868–875. DOI: 10.2307/1934302.

  • Gauch, H.G., Whittaker R.H. & Wentworth T.R. (1977). A comparative study of reciprocal averaging and other ordination techniques. J. Ecol., 65, 157–174. DOI: 10.2307/2259071.

  • Gauch, H.G., Whittaker R.H. & Singer S.B. (1981). A comparative study of nonmetric ordinations. J. Ecol., 69, 135–152. DOI: 10.2307/2259821

  • Gehlhausen, S.M., Schwartz, M.W. & Augspurger C.K. (2000). Vegetation and microclimatic edge effects in two mixed mesophytic forest fragments. Plant Ecol., 147, 21−35. DOI: 10.1023/A:1009846507652.

  • Goff, F.G. & Mitchell R. (1975). A comparison of species ordination results from plot and stand data. Vegetatio, 31, 15−22. DOI: 10.1007/BF00127871.

  • Goodall, D.W. (1953). Objective methods for the classification of vegetation. III. An essay in the use of factor analysis. Aust. J. Bot., 1, 39−63. DOI: 10.1071/BT9530039.

  • Gorsuch, R.L. (1983). Factor analysis. Hillsdale NJ: Lawrence Erlbaum Associates.

  • Hatcher, L. (1994). A step-by-step approach to using the SAS system for factor analysis and structural equation modeling. Cary: SAS Institute.

  • Hill, M.O. (1973). Reciprocal averaging: an eigenvector method of ordination. J. Ecol., 61, 237−249. DOI: 10.2307/2258931.

  • Hill, M.O. & Gauch H.G. (1980). Detrended correspondence analysis: an improved technique. Vegetatio, 42, 47−58. DOI: 10.1007/BF00048870.

  • Hirosawa, Y., Marsh, S.E. & Kliman D.H. (1996). Application of standardized principal component analysis to land-cover characterization using multi temporal AVHRR data. Remote Sens. Environ., 58, 267−281. DOI: 10.1016/S0034-4257(96)00068-5.

  • Hirst, C.N. & Jackson D.A. (2007). Reconstructing community relationships: the impact of sampling error, ordination approach and gradient length. Divers. Distrib., 13, 361–371. DOI: 10.1111/j.1472-4642.2007.00307.x.

  • Hutcheson, G. & Sofroniou N. (1999). The multivariate social scientist: Introductory statistics using generalized linear models. London: Sage Publication.

  • Jackson, D.A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology, 74, 2204−2214. DOI: 10.2307/1939574.

  • Jackson, J.A. (1991). A user’s guide to principal component analysis. New York: Wiley Inter Science.

  • James, F.C. & McCulloch C.E. (1990). Multivariate analysis in ecology and systematics: panacea or Pandoras box. Annu. Rev. Ecol. Evol. Syst., 21, 129−166. DOI: 10.1146/annurev.es.21.110190.001021.

  • Joliffe, I. (2002). Principal component analysis. New York: Springer-Verlag.

  • Kendall, M. (1980). Multivariate analysis. London: Charles Griffin.

  • Kline, P. (1979). Psychometrics and psychology. London: Academic Press.

  • Knox, R.G. & Peet R.K. (1989). Bootstrapped ordination: a method for estimating sampling effects in indirect gradient analysis. Vegetatio, 80, 153−165. DOI: 10.1007/BF00048039.

  • Lawley, D.N. & Maxwell A.E. (1971). Factor analysis as a statistical method. New York: Macmillan.

  • Legendre, P. & Birks H.J.B. (2012). Clustering and partitioning. In H.J.B. Birks, A.F. Lotter, S. Juggins & J.P. Smol (Eds.), Tracking environmental change using lake sediments Vol. 5: Data handling and numerical techniques (pp. 167−200). Dordrecht: Springer. DOI: 10.1007/978-94-007-2745-8_7.

  • MacCallum, R.C., Widaman, K.F., Zhang, S. & Hong S. (1999). Sample size in factor analysis. Psychological Methods, 4, 84−99. DOI: 10.1037/1082-989X.4.1.84.

  • MacCallum, R.C., Widaman, K.F., Preacher, K.J. & Hong S. (2001). Sample size in factor analysis: The role of model error. Multivariate Behavioral Research, 36, 611–637. DOI: 10.1207/S15327906MBR3604_06.

  • Manjarres-Martinez, L.M., Gutiérrez-Estrada, J.C., Hernando, J.J.A. & Soriguer M.C. (2012). The performance of three ordination methods applied to demersal fish data sets: stability and interpretability. Fish. Manag. Ecol., 19, 200−213. DOI: 10.1111/j.1365-2400.2011.00817.x.

  • Manly, B.F.J. (1998). Randomization, bootstrap and Monte Carlo methods in biology. London: Chapman & Hall.

  • Minchin, P.R. (1987). An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio, 69, 89−107. DOI: 10.1007/BF00038690.

  • Mundfrom, D.J., Shaw, D.G. & Ke T.L. (2005). Minimum sample size recommendations for conducting factor analyses. International Journal of Testing, 5, 159−168. DOI: 10.1207/s15327574ijt0502_4.

  • Okland, R.H., Eilersten, O. & Okland T. (1990). On the relationship between sample size and beta diversity in boreal coniferous forests. Vegetatio, 87, 187−190. DOI: 10.1007/BF00042954.

  • Orloci, L. (1966). Geometric models in ecology 1. The theory and application of some ordination methods. J. Ecol., 54, 193−215. DOI: 10.2307/2257667.

  • Orloci, L. (1978). Multivariate analysis in vegetation research. The Hague: Junk.

  • Osborne, J.W. & Costello A.B. (2004). Sample size and subject to item ratio in principal components analysis. Practical Assessment Research & Evaluation, 9, 15−23.

  • Otypkova, Z. & Chytry M. (2006). Effects of plot size on the ordination of vegetation samples. J. Veg. Sci., 17, 465−472. DOI: 10.1111/j.1654-1103.2006.tb02467.x.

  • Peres-Neto, P.R., Jackson, D.A. & Somers K.M. (2003). Giving meaningful interpretation to ordination axes: assessing loading significance in principal component analysis. Ecology, 84, 2347–2363. http://www.jstor.org/stable/3450140

  • Peres-Neto, P.R., Jackson, D.A. & Somers K.M. (2005). How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis, 49, 974−997. DOI: 10.1016/j.csda.2004.06.015.

  • Pillar, V. de P. (1999). The bootstrapped ordination re-examined. J. Veg. Sci., 10, 895−902. DOI: 10.2307/3237314.

  • Preacher, K.J. & MacCallum R.C. (2002). Exploratory factor analysis in behavioral genetics research: Factor recovery with small sample sizes. Behav. Genet., 32, 153−161. DOI: 10.1023/A:1015210025234.

  • Rao, C.R. (1964). The use and interrelation of principal component analysis in applied research. Sankhya (Ser. A), 26, 329−358. http://www.jstor.org/stable/25049339

  • Richman, M.B. (1988). A cautionary note concerning a commonly applied eigen analysis procedure. Tellus B, 40, 50−58. DOI: 10.1111/j.1600-0889.1988.tb00212.x.

  • Shaukat, S.S. (1985). Approaches to the analysis of ruderal weed vegetation. PhD. thesis, University of Western Ontario, London, Canada.

  • Shaukat, S.S. & Uddin M. (1989a). A comparison of principal component and factor analysis as an ordination model with reference to desert ecosystem. Coenoses, 4, 15−28. http://www.jstor.org/stable/43461254

  • Shaukat, S.S. & Uddin M. (1989b). An application of canonical and principal component analysis to the study of desert environment. Abstracta Botanica (Budapest), 13, 17−45. http://www.jstor.org/stable/43519176

  • Shaukat, S.S. & Siddiqui I.A. (2005). Essentials of Mathematical Ecology: Computer Programs in BASIC, FORTRAN and C++. Karachi: Farquan Publishers.

  • Shaukat, S.S., Sheikh I.H. & Siddiqui I.A. (2005). An application of correspondence analysis, Detrended correspondence analysis and Canonical correspondence analysis to the vegetation and environment of calcareous hills around Karachi. Int. J. Biol. Biotechnol., 2, 617−627.

  • Stauffer, D. F., Garton E.O. & Steinhorst R.K. (1985). A comparison of principal component from real and random data. Ecology, 66, 1693−1698. DOI: 10.2307/2937364.

  • Swan, J.M.A. & Dix R.L. (1966). The phytosociological structure of upland forest at Candle Lake, Saskatchewan. J. Ecol., 54, 13−40. DOI: 10.2307/2257657.

  • Ter Braak, C.J.F. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology, 67, 1167−1179. DOI: 10.2307/1938672.

  • Velicer, W.F. & Fava J.L. (1998). The effects of variable and subject sampling on factor pattern recovery. Psychological Methods, 3, 231−251. DOI: 10.1037/1082-989X.3.2.231.

  • Walker, S.C. & Jackson D.A. (2011). Random-effects ordination: describing and predicting multivariate correlations and co-occurrences. Ecol. Monogr., 81, 635–663. http://www.jstor.org/stable/23208478

  • Whittaker, R.J. (1987). An application of detrended correspondence analysis and nonmetric multidimensional scaling to the identification and analysis of environmental factor complexes and vegetation structures. J. Ecol., 75, 363−376. DOI: 10.2307/2260424.

  • Wikum, D.A. & Wali M.K. (1974). Analysis of a North Dakota gallery forest: Vegetation in relation to topographic and soil gradients. Ecol. Monogr., 44, 441–464. DOI: 10.2307/1942449.

OPEN ACCESS

Journal + Issues

Search