Impact of sample size on principal component analysis ordination of an environmental data set: effects on eigenstructure

Open access

Abstract

In this study, we used bootstrap simulation of a real data set to investigate the impact of sample size (N = 20, 30, 40 and 50) on the eigenvalues and eigenvectors resulting from principal component analysis (PCA). For each sample size, 100 bootstrap samples were drawn from environmental data matrix pertaining to water quality variables (p = 22) of a small data set comprising of 55 samples (stations from where water samples were collected). Because in ecology and environmental sciences the data sets are invariably small owing to high cost of collection and analysis of samples, we restricted our study to relatively small sample sizes. We focused attention on comparison of first 6 eigenvectors and first 10 eigenvalues. Data sets were compared using agglomerative cluster analysis using Ward’s method that does not require any stringent distributional assumptions.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Anderson M.J. & Wilis T.J. (2003). Canonical analysis of principal coordinates: a useful method of constrained ordination for ecology. Ecology 84 511–525. DOI: 10.1890/0012-9658(2003)084[0511:CAOPCA]2.0.CO;2.

  • APHA (1992). Standard methods for the examination of water and waste water. American Washington: Public Health Association.

  • Bandalos D.L. & Boehm-Kaufman M.R. (2009). Four common misconceptions in exploratory factor analysis. In C.E. Lance & R.J. Vandenberg (Eds.) Statistical and methodological myths and urban legends (pp. 61–87). New York: Routledge Publisher.

  • Barrett P.T. & Kline P. (1981). The observation to variable ratio in factor analysis. Personality Study and Group Behaviour 1 23−33.

  • Bray J.R. & Curtis J.T. (1957). An ordination of the upland forest communities of Southern Wisconsin. Ecol. Monogr. 27 325–349. DOI: 10.2307/1942268.

  • Bryant F.B. & Yarnold P.R. (1995). Principal components analysis and exploratory and confirmatory factor analysis. In L.G. Grimm & R.R. Yarnold (Eds.) Reading and understanding multivariate statistics (pp. 99−136). Washington: American Psycholgical Association.

  • Burd B.J.A. Nemec A. & Brinkhurst R.O. (1990). The development and application of analytical methods in benthic marine faunal studies. Adv. Mar. Biol. 26 169−247. DOI: 10.1016/S0065-2881(08)60201-1.

  • Cadima J. & Jolliffe I.T. (1995). Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics 22 203−214. DOI: 10.1080/757584614.

  • Cattell R.B. (1966). The Scree test for the number of factors. Multivariate Behavioral Research 1 245–276. DOI: 10.1207/s15327906mbr0102_10.

  • Cattell R.B. (1978). The scientific use of factor analysis in behavioral and life sciences. New York: Plenum Press.

  • Chateau F. & Lebart L. (1996). Assessing sample variability in the visualization techniques related to principal component analysis: Bootstrap and alternative simulation methods. In A. Prats (Ed.) Proceedings of COMPSTAT 2006. Heidelberg: Physica Verlag.

  • Chatfield C. & Collins A.J. (1980). Introduction to multivariate analysis. London New York: Chapman & Hall.

  • Comrey A.L. & Lee H.B. (1992). A first course in factor analysis. London: Taylor and Francis.

  • de Winter J.C.F. Dodou D. & Wieringa P.A. (2009). Exploratory factor analysis with small sample sizes. Multivariate Behavioral Research 44 147−181. DOI: 10.1080/00273170902794206.

  • Dengler J. Lobel S. & Dolnik C. (2009). Species constancy depends on plot size a problem for vegetation classification and how it can be solved. J. Veg. Sci. 20 754−766. DOI: 10.1111/j.1654-1103.2009.01073.x.

  • Diaconis P. & Efron B. (1983). Computer-intensive methods in statistics. Sci. Am. 248 116−130. doi:10.1038/scientificamerican0583-116

  • Dochtermann N.A. & Jenkins S.H. (2011). Multivariate methods and small sample sizes. Ethology 117 95−101. DOI: 10.1111/j.1439-0310.2010.01846.x.

  • Fasham M.J.R. (1977). The comparison of nonmetric multidimensional scaling principal component analysis and reciprocal averaging for the ordination of simulated coenocline and coenoplanes. Ecology 58 551−561. DOI: 10.2307/1939004

  • Forcino F.L. (2012). Multivariate assessment of the required sample size for community paleoecological research. Palaeogeo. Palaeoclimatol. Palaeoecol. 315−316 134−141. DOI: 10.1016/j.palaeo.2011.11.019.

  • Gamito S. & Raffaelli D. (1992). The sensitivity of several ordination methods to sample replication in benthic surveys. J. Exp. Mar. Biol. Ecol. 164 221−232. DOI: 10.1016/0022-0981(92)90176-B.

  • Gauch H.G. & Whittaker R.H. (1972). Comparison of ordination techniques. Ecology 53 868–875. DOI: 10.2307/1934302.

  • Gauch H.G. Whittaker R.H. & Wentworth T.R. (1977). A comparative study of reciprocal averaging and other ordination techniques. J. Ecol. 65 157–174. DOI: 10.2307/2259071.

  • Gauch H.G. Whittaker R.H. & Singer S.B. (1981). A comparative study of nonmetric ordinations. J. Ecol. 69 135–152. DOI: 10.2307/2259821

  • Gehlhausen S.M. Schwartz M.W. & Augspurger C.K. (2000). Vegetation and microclimatic edge effects in two mixed mesophytic forest fragments. Plant Ecol. 147 21−35. DOI: 10.1023/A:1009846507652.

  • Goff F.G. & Mitchell R. (1975). A comparison of species ordination results from plot and stand data. Vegetatio 31 15−22. DOI: 10.1007/BF00127871.

  • Goodall D.W. (1953). Objective methods for the classification of vegetation. III. An essay in the use of factor analysis. Aust. J. Bot. 1 39−63. DOI: 10.1071/BT9530039.

  • Gorsuch R.L. (1983). Factor analysis. Hillsdale NJ: Lawrence Erlbaum Associates.

  • Hatcher L. (1994). A step-by-step approach to using the SAS system for factor analysis and structural equation modeling. Cary: SAS Institute.

  • Hill M.O. (1973). Reciprocal averaging: an eigenvector method of ordination. J. Ecol. 61 237−249. DOI: 10.2307/2258931.

  • Hill M.O. & Gauch H.G. (1980). Detrended correspondence analysis: an improved technique. Vegetatio 42 47−58. DOI: 10.1007/BF00048870.

  • Hirosawa Y. Marsh S.E. & Kliman D.H. (1996). Application of standardized principal component analysis to land-cover characterization using multi temporal AVHRR data. Remote Sens. Environ. 58 267−281. DOI: 10.1016/S0034-4257(96)00068-5.

  • Hirst C.N. & Jackson D.A. (2007). Reconstructing community relationships: the impact of sampling error ordination approach and gradient length. Divers. Distrib. 13 361–371. DOI: 10.1111/j.1472-4642.2007.00307.x.

  • Hutcheson G. & Sofroniou N. (1999). The multivariate social scientist: Introductory statistics using generalized linear models. London: Sage Publication.

  • Jackson D.A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology 74 2204−2214. DOI: 10.2307/1939574.

  • Jackson J.A. (1991). A user’s guide to principal component analysis. New York: Wiley Inter Science.

  • James F.C. & McCulloch C.E. (1990). Multivariate analysis in ecology and systematics: panacea or Pandoras box. Annu. Rev. Ecol. Evol. Syst. 21 129−166. DOI: 10.1146/annurev.es.21.110190.001021.

  • Joliffe I. (2002). Principal component analysis. New York: Springer-Verlag.

  • Kendall M. (1980). Multivariate analysis. London: Charles Griffin.

  • Kline P. (1979). Psychometrics and psychology. London: Academic Press.

  • Knox R.G. & Peet R.K. (1989). Bootstrapped ordination: a method for estimating sampling effects in indirect gradient analysis. Vegetatio 80 153−165. DOI: 10.1007/BF00048039.

  • Lawley D.N. & Maxwell A.E. (1971). Factor analysis as a statistical method. New York: Macmillan.

  • Legendre P. & Birks H.J.B. (2012). Clustering and partitioning. In H.J.B. Birks A.F. Lotter S. Juggins & J.P. Smol (Eds.) Tracking environmental change using lake sediments Vol. 5: Data handling and numerical techniques (pp. 167−200). Dordrecht: Springer. DOI: 10.1007/978-94-007-2745-8_7.

  • MacCallum R.C. Widaman K.F. Zhang S. & Hong S. (1999). Sample size in factor analysis. Psychological Methods 4 84−99. DOI: 10.1037/1082-989X.4.1.84.

  • MacCallum R.C. Widaman K.F. Preacher K.J. & Hong S. (2001). Sample size in factor analysis: The role of model error. Multivariate Behavioral Research 36 611–637. DOI: 10.1207/S15327906MBR3604_06.

  • Manjarres-Martinez L.M. Gutiérrez-Estrada J.C. Hernando J.J.A. & Soriguer M.C. (2012). The performance of three ordination methods applied to demersal fish data sets: stability and interpretability. Fish. Manag. Ecol. 19 200−213. DOI: 10.1111/j.1365-2400.2011.00817.x.

  • Manly B.F.J. (1998). Randomization bootstrap and Monte Carlo methods in biology. London: Chapman & Hall.

  • Minchin P.R. (1987). An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio 69 89−107. DOI: 10.1007/BF00038690.

  • Mundfrom D.J. Shaw D.G. & Ke T.L. (2005). Minimum sample size recommendations for conducting factor analyses. International Journal of Testing 5 159−168. DOI: 10.1207/s15327574ijt0502_4.

  • Okland R.H. Eilersten O. & Okland T. (1990). On the relationship between sample size and beta diversity in boreal coniferous forests. Vegetatio 87 187−190. DOI: 10.1007/BF00042954.

  • Orloci L. (1966). Geometric models in ecology 1. The theory and application of some ordination methods. J. Ecol. 54 193−215. DOI: 10.2307/2257667.

  • Orloci L. (1978). Multivariate analysis in vegetation research. The Hague: Junk.

  • Osborne J.W. & Costello A.B. (2004). Sample size and subject to item ratio in principal components analysis. Practical Assessment Research & Evaluation 9 15−23.

  • Otypkova Z. & Chytry M. (2006). Effects of plot size on the ordination of vegetation samples. J. Veg. Sci. 17 465−472. DOI: 10.1111/j.1654-1103.2006.tb02467.x.

  • Peres-Neto P.R. Jackson D.A. & Somers K.M. (2003). Giving meaningful interpretation to ordination axes: assessing loading significance in principal component analysis. Ecology 84 2347–2363. http://www.jstor.org/stable/3450140

  • Peres-Neto P.R. Jackson D.A. & Somers K.M. (2005). How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis 49 974−997. DOI: 10.1016/j.csda.2004.06.015.

  • Pillar V. de P. (1999). The bootstrapped ordination re-examined. J. Veg. Sci. 10 895−902. DOI: 10.2307/3237314.

  • Preacher K.J. & MacCallum R.C. (2002). Exploratory factor analysis in behavioral genetics research: Factor recovery with small sample sizes. Behav. Genet. 32 153−161. DOI: 10.1023/A:1015210025234.

  • Rao C.R. (1964). The use and interrelation of principal component analysis in applied research. Sankhya (Ser. A) 26 329−358. http://www.jstor.org/stable/25049339

  • Richman M.B. (1988). A cautionary note concerning a commonly applied eigen analysis procedure. Tellus B 40 50−58. DOI: 10.1111/j.1600-0889.1988.tb00212.x.

  • Shaukat S.S. (1985). Approaches to the analysis of ruderal weed vegetation. PhD. thesis University of Western Ontario London Canada.

  • Shaukat S.S. & Uddin M. (1989a). A comparison of principal component and factor analysis as an ordination model with reference to desert ecosystem. Coenoses 4 15−28. http://www.jstor.org/stable/43461254

  • Shaukat S.S. & Uddin M. (1989b). An application of canonical and principal component analysis to the study of desert environment. Abstracta Botanica (Budapest) 13 17−45. http://www.jstor.org/stable/43519176

  • Shaukat S.S. & Siddiqui I.A. (2005). Essentials of Mathematical Ecology: Computer Programs in BASIC FORTRAN and C++. Karachi: Farquan Publishers.

  • Shaukat S.S. Sheikh I.H. & Siddiqui I.A. (2005). An application of correspondence analysis Detrended correspondence analysis and Canonical correspondence analysis to the vegetation and environment of calcareous hills around Karachi. Int. J. Biol. Biotechnol. 2 617−627.

  • Stauffer D. F. Garton E.O. & Steinhorst R.K. (1985). A comparison of principal component from real and random data. Ecology 66 1693−1698. DOI: 10.2307/2937364.

  • Swan J.M.A. & Dix R.L. (1966). The phytosociological structure of upland forest at Candle Lake Saskatchewan. J. Ecol. 54 13−40. DOI: 10.2307/2257657.

  • Ter Braak C.J.F. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology 67 1167−1179. DOI: 10.2307/1938672.

  • Velicer W.F. & Fava J.L. (1998). The effects of variable and subject sampling on factor pattern recovery. Psychological Methods 3 231−251. DOI: 10.1037/1082-989X.3.2.231.

  • Walker S.C. & Jackson D.A. (2011). Random-effects ordination: describing and predicting multivariate correlations and co-occurrences. Ecol. Monogr. 81 635–663. http://www.jstor.org/stable/23208478

  • Whittaker R.J. (1987). An application of detrended correspondence analysis and nonmetric multidimensional scaling to the identification and analysis of environmental factor complexes and vegetation structures. J. Ecol. 75 363−376. DOI: 10.2307/2260424.

  • Wikum D.A. & Wali M.K. (1974). Analysis of a North Dakota gallery forest: Vegetation in relation to topographic and soil gradients. Ecol. Monogr. 44 441–464. DOI: 10.2307/1942449.

Search
Journal information
Impact Factor


CiteScore 2018: 0.77

SCImago Journal Rank (SJR) 2018: 0.283
Source Normalized Impact per Paper (SNIP) 2018: 0.534

Cited By
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 708 328 28
PDF Downloads 331 201 18