This study focuses on the progressive vs. non-progressive alternation to revisit the debate on the ENL-ESL-EFL continuum (i.e. whether native (ENL) and nonnative (ESL/EFL) Englishes are dichotomous types of English or form a gradient continuum). While progressive marking is traditionally studied independently of its unmarked counterpart, we examine (i) how the grammatical contexts of both constructions systematically affect speakers’ constructional choices in ENL (American, British), ESL (Indian, Nigerian and Singaporean) and EFL (Finnish, French and Polish learner Englishes) and (ii) what light speakers’ varying constructional choices bring to the continuum debate. Methodologically, we use a clustering technique to group together individual varieties of English (i.e. to identify similarities and differences between those varieties) based on linguistic contextual features such as AKTIONSART, ANIMACY, SEMANTIC DOMAIN (of aspect-bearing lexical verb), TENSE, MODALITY and VOICE to assess the validity of the ENL-ESL-EFL classification for our data. Then, we conduct a logistic regression analysis (based on lemmas observed in both progressive and non-progressive constructions) to explore how grammatical contexts influence speakers’ constructional choices differently across English types. While, overall, our cluster analysis supports the ENL-ESL-EFL classification as a useful theoretical framework to explore cross-variety variation, the regression shows that, when we start digging into the specific linguistic contexts of (non-)progressive constructions, this classification does not systematically transpire in the data in a uniform manner. Ultimately, by including more than one statistical technique into their exploration of the continuum, scholars could avoid potential methodological biases.
Corpus-based studies of learner language and (especially) English varieties have become more quantitative in nature and increasingly use regression-based methods and classifiers such as classification trees, random forests, etc. One recent development more widely used is the MuPDAR (Multifactorial Prediction and Deviation Analysis using Regressions) approach of Gries and Deshors (2014) and Gries and Adelman (2014). This approach attempts to improve on traditional regression- or tree-based approaches by, firstly, training a model on the reference speakers (often native speakers (NS) in learner corpus studies or British English speakers in variety studies), then, secondly, using this model to predict what such a reference speaker would produce in the situation the target speaker is in (often non-native speakers (NNS) or indigenized-variety speakers). Crucially, the third step then consists of determining whether the target speakers made a canonical choice or not and explore that variability with a second regression model or classifier.
Both regression-based modeling in general and MuPDAR in particular have led to many interesting results, but we want to propose two changes in perspective on the results they produce. First, we want to focus attention on the middle ground of the prediction space, i.e. the predictions of a regression/classifier that, essentially, are made non-confidently and translate into a statement such as ‘in this context, both/all alternants would be fine’. Second, we want to make a plug for a greater attention to misclassifications/-predictions and propose a method to identify those as well as discuss what we can learn from studying them. We exemplify our two suggestions based on a brief case study, namely the dative alternation in native and learner corpus data.