We will begin by reporting on our previous projects, the Simple English Wikipedia Corpus (2017) and the American Education Research Association (AERA) corpus (2019), as examples of corpora created using freely available text databases. These projects both collect examples of authentic language that can be used for both research and pedagogical purposes. We briefly introduce those two corpora below.
The Simple English Wikipedia Corpus was created in 2016 from the user-contributed online encyclopedia Simple English Wikipedia (SEW). The SEW was created using simplified language, and intended to be an accessible reference for learners of English. We analyzed the vocabulary demands of the SEW using AntConc and Lextutor using vocabulary lists. We found that the vocabulary requirements of the SEW are similar to normal Wikipedia (Hendry & Sheepy, 2017).
The American Education Research Association (AERA) corpus was created in 2018 from the AERA open access repository, which collects conference papers submitted to the AERA annual conference. We used word lists such as the Academic Word List (Browne, Culligan, & Phillips, 2013) to assess the vocabulary requirements to read submissions in each division of the AERA annual conference.
For our workshop, we will first invite discussion of sources of authentic texts that participants could collect to build their own corpora. We will then demonstrate how to use the tools available on Lextutor to clean and compile a small corpus.
We will apply simple analytical techniques to the Simple English Wikipedia Corpus using AntConc to:
generate frequency lists,
determine the most frequent vocabulary items in a given corpus, and
use stop lists on the AntConc website to remove function words, comparatively common items, and Academic Vocabulary in the form of the AWL.
Next, we will produce a vocabulary profile of the corpus using tools available on Lextutor, estimate its readability, and then compare two different texts from within the corpus to determine which vocabulary items they have in common.
Last, we will explore some of the tools available as part of Lancsbox, such as the keyword tool, collocation, and colligation identification tools, to show how one can explore beyond vocabulary demands.
Participants will be invited to explore both the Simple English Wikipedia Corpus as well as their own creations. Each section will also include relevant research and examples for how to use these techniques in the classroom. The end of the workshop will be open for participants to share their own experiences and ideas of how to better use corpora for research and pedagogical purposes.