Corpus linguistics

We research across a range of linguistic areas

Volunteer for one of our research studies

Exploring patterns of language use

Corpus methodology – the investigation of collections of text to explore patterns of language usage – is one that is commonly employed in linguistics, and unites a wide range of subdisciplines.

We have a strong tradition in this area of language research. A range of corpora are hosted at Macquarie, many of which have been built in our department.

Depending on the nature of the corpus, it’s possible to do research into topics as diverse as:

the characteristics of different spoken and written registers
the development of child language
language change over time
variation across regions.

Areas of interest

Within the Department of Linguistics, various research areas make use of corpora, including:

Corpora of children’s spontaneous speech production and the input that they hear are essential to research how children learn language.

We use existing corpora in the CHILDES database as well as purpose-built corpora to inform and extend our experimental work on children’s acquisition of sound structure, morphology, syntax and interaction.

Three of the audio (and/or video) corpora available on the CHILDES database were developed by researchers now at Macquarie:

The Providence (English) Database
The Lyon (French) Database
the Demuth Sesotho Corpus.

We use natural language corpora to study many kinds of social contexts, including:

media and political discourse
clinical consultations in medicine and psychotherapy
literary texts.

We draw on both specialised register-specific corpora (where the data comes from one kind of social context), as well as large multi-generic corpora, such as the British National Corpus.

A number of researchers working in the focus area language variation and change make use of synchronic and diachronic corpora to investigate how languages vary in different settings and across time.

Corpora can be used to investigate variation not just in what people say but how they say it.

AusTalk is a large state-of-the-art database of spoken Australian English from all around the country. Collected from 2011-2016, almost a thousand adults with ages ranging from 18 to 83 from 15 different locations in all states and territories were recorded.

AusTalk represents regional and social diversity and linguistic variation of Australian English, including Australian Aboriginal English. Each speaker was audio and video recorded on three separate occasions to sample their voice in a range of scripted and spontaneous speech situations at various times.

Access AusTalk through Alveo.

We use different corpora of student writing to search for and investigate the micro- (lexico-grammatical) and macro-level (generic and rhetorical) features of discipline-specific genres.

The outcomes of the student writing corpus research will help different stakeholders in academia and beyond to deal with issues related to academic communication and literacy.

We also intend to develop local student writing corpora to complement ones such as the British Academic Written English (BAWE) corpus.

We use electronic corpora and quantitative corpus linguistic methods to analyse the linguistic features that set translated language apart from non-translated language.

We try to 'fingerprint' what makes translated language different from language that has not been translated, and develop hypotheses about the cognitive and social constraints that give rise to these features.

We also use corpus methods to investigate a variety of other research questions in translation and interpreting, including translation style and ideology in translation.

ICE (the International Corpus of English) has greatly facilitated study into the convergence and divergence of Englishes around the world. It currently contains 23 equivalent one-million-word corpora of both spoken and written English for regions including:

Australia
Great Britain
Hong Kong
India
Jamaica
New Zealand
the Philippines
South Africa.

For features that require larger amounts of data, the GloWbE (Global Web-based English) corpus provides multi-million-word collections of written text.

Our projects

See some of the corpus linguistics projects our researchers are working on.

The following corpora were collected at Macquarie and are available to researchers via the Figshare repository

ACE (Australian Corpus of English): https://doi.org/10.25949/24629712.v1
ICE-AUS, the Australian component of ICE: https://doi.org/10.25949/24769173.v1
ART (Australian Radio Talkback corpus): https://doi.org/10.25949/24769434.v1

This project uses newly compiled comparable historical corpora of the British, Australian and South African Hansard to investigate how written English usage changes over time in three varieties of English.

This project is funded under a Macquarie University Research Development Grant (MQRDG 2017-2018).

This project investigates how regional varieties develop their local features while in contact with neighbouring varieties and 'super varieties' (such as American and British English).

The research will examine written, spoken and online discussion data from corpus collections of varieties of English such as Australian, Indian, New Zealand and Sri Lankan, so as to test whether more formal registers of writing (eg parliamentary records, newspapers) are more or less receptive to international English than informal conversation or online interaction.

This project is funded by a Universities Australia/DAAD grant (2018-2019) in partnership with the Justus Liebig University Giessen.

This project, initiated in 2006 and still very active, uses specialised corpora to find headwords and provide definitions for online termbanks focusing on:

academic areas for first-year students on TermFinder
the needs of the general public, in the areas of family law on LawTermFinder and cancer treatment on HealthTermFinder.

Our people

Meet some of the academics and students involved in this research.

Our current and recent research students are:

Ibrahim Alasmri
- PhD thesis title: The features of translated language across register and time: A corpus-based study of translation from English to Arabic
Hayyan Al-Roussan
- PhD thesis title: Translation of cultural references in the Arabic subtitling of feature films: A parallel corpus-based
- Supervisors: Professor Jan-Louis Kruger and Dr Nick Wilson (Macquarie), and Associate Professor Ashraf Fattah (Hamad Bin Khalifa University, Doha)
Eisa Asiri
- PhD thesis title: Translation strategies for culture-specific items in the Qur’an: A corpus-based descriptive study.
Emi Iwasaki
- PhD thesis title: Medical Terms and Conceptualisation of Chest Pain: Differences in Scope for Healthcare Professionals
  Supervisors: Emeritus Professor Pamela Peters and Dr Adam Smith
Mi Gyeong Kim
- MRes thesis title: A corpus-based approach to community interpreting
- Supervisors: Dr Adam Smith and Dr Helen Slatyer
Yousef Sahari
- PhD thesis title: A corpus-based study of taboo language in Arabic subtitles
- Supervisors: Professor Jan-Louis Kruger and Dr Nick Wilson (Macquarie), and Associate Professor Ashraf Fattah (Hamad Bin Khalifa University, Doha)
Angela Turzynski-Azimi
- PhD thesis title: The representation of foreigners in Japanese newspaper discourse
- Supervisors: Dr Chavalin Svetanant and Dr Adam Smith
Xiaomin Zhang
- PhD thesis title: Investigating explicitation in children’s literature translated between English and Chinese
- Supervisors: Dr Jing Fang and Professor Haidee Kotze