Using
Corpus Analysis Software to Analyse Specialised Texts
What is a corpus?
In corpus linguistics, a
corpus (sometimes used in the plural form “corpora”) can be generally defined as… ‘a
collection of naturally-occurring texts in a computer-readable format which can be retrieved and analyzed using
corpus analysis software’ (Kennedy, 1998; McEnery &
Wilson, 2001; O’Keeffe, A., McCarthy, M.,
& Carter, R. , 2007; Teubert & Cermakova, 2007)
Sources of language corpora
·
Subscribe to a large corpus provider such
as the British National Corpus (BNC)
·
Use web concordancing
-
http://corpus.leeds.ac.uk/protected/query.html (general corpus; English)
-
http://corpus.byu.edu/ (general
corpus; American/British English)
-
http://lextutor.ca/conc/eng/ (general and specialized corpora;
English)
·
Compile own corpora and analyse data using
corpus analysis software
-
Antconc’
(http://www.antlab.sci.waseda.ac.jp/software.html) (for
monolingual corpus)
-
‘Wordsmith’
(http://www.lexically.net/wordsmith/) (for
monolingual corpus)
Designing a specialized corpus
Corpus size
·
There are no fixed ruled; depending on
research purposes, availability of data and time.
·
Large, general corpora may be less useful
than small, focused corpora if searches are made on context-specific
terms.
·
There are limitations of ‘too small’
corpora e.g. not enough concepts,
terms, or patterns under investigation.
·
It is preferable to create a ‘monitor’ or
‘open’ corpus because specialized words/usage are dynamic.
Text extracts vs. full texts
·
Depends on the aim of corpus compilation.
·
Whole text offers more coverage because
words or terms to be looked at may be randomly distributed throughout the text.
·
Specific sections may be helpful if we are
looking for words or phrase under particular content areas or want to create
purposeful sub-corpora.
Number of texts
·
Choices can be made between collect few
texts of large size or a number of texts with smaller sizes.
·
Choices can also be made between selecting
texts written by one or two key writers or sources, or texts retrieved from
different sources or written by different authors.
·
Depends on your research focus e.g. to study overall language use or to
study idiosyncrasy or linguistic choices preferred by particular writers.
Medium
·
Can be spoken or written texts or mixed.
·
Depends on research questions.
·
Some practical factors should also be
considered e.g. compiling spoken
corpora can be time-consuming and needs special types of
tagging (= giving codes to the data e.g. turn-taking paralinguistic features)
Subject and text type
·
Should mainly focus on the specialized
text under investigation, although this is less clear-cut
in multidisciplinary subjects.
·
Texts may come from different subject if
the research focus is on the study of particular language features rather than
term extraction.
·
Text types within a specialized subject
field may vary from ‘expert-to-expert’
texts to ‘expert-to-non-expert’
texts, or in other words, from technical to popular texts.
Other considerations
·
Authorship: Texts written
by experts in a field tend to present more reliable and authentic examples of
specialized language.
·
Language: Specialised
texts can be stored and retrieved in the form of monolingual, comparable, or
parallel corpora.
·
Publication date: Texts
should come from recent publications unless queries are made in relation to
particular periods of time.
Sources of specialized texts
·
Printed materials
·
Word document
·
CD-ROMs
·
Texts on the Web
·
Online databases
Getting started with Antconc
Download the latest version of
Antconc watch YouTube tutorials from http://www.antlab.sci.waseda.ac.jp/antconc_index.html
1. Run the program.
2. Open Files (browse
and select targeted files) or Open Dir (to
select targeted folders)
3. Choose the function.
4. Clear All Tools and Files before
selecting opening new files.
5. Save Output to Text File to save
output e.g. concordance lines.
ไม่มีความคิดเห็น:
แสดงความคิดเห็น