Corpus of Spoken Greek

The Institute's Corpus of Spoken Greek is part of the Greek Talk-in-interaction and Conversation Analysis research project, directed by professor Th.-S. Pavlidou. It was originally designed for the qualitative analysis of language and linguistic communication, especially from the perspective of Conversation Analysis, which gives it its special features. Part of the Corpus, though, is available online and can be used for quantitative analysis.

Features of the Corpus of Spoken Greek

The need for a corpus of spoken Greek arises primarily from the priority modern linguistics attributes to spoken over written discourse in general. Based, however, on the findings of sociolinguistics, the study of spoken discourse ought to be grounded in language material drawn from naturally-occurring circumstances of communication, such, that is, that allow for its spontaneous and unconstrained production. As a consequence, the compilation of a corpus of spoken discourse poses a series of challenges for researchers (which range from overcoming the so-called 'observer's paradox' to ensuring the participants' consent to the tape-/video-recording), which do not arise in the case of corpora of written discourse, especially corpora of published texts.

The Corpus of Spoken Greek was originally designed for the qualitative analysis of language and linguistic communication, especially from the perspective of Conversation Analysis. Consequently, particular emphasis is placed on the transcription of tape-/video-recorded material as the faithful representation of sound reality.

For Conversation Analysis, transcription is not a mechanistic procedure (see related software in the market) nor is it restricted to the presentation of content (see print news interviews). On the contrary, the 'translation' of sound into writing presupposes theoretical processing and analysis as well as relevant training, and requires multiple 'corrections' by different individuals.

As a result, the transcribed texts of the Institute's Corpus of Spoken Greek depart from the standard orthographic representation of spoken discourse in that additional symbols are used to mark overlaps, pauses, intonational and other features of spoken discourse (see Transcription symbols). The texts also differ in the degree of precision with which they have been transcribed.

Size and discourse types of the Corpus of Spoken Greek

The Corpus of Spoken Greek is a set of digital files, which is updated and enriched according to the research project’s affordances and needs. The Corpus consists of three parts (see Pavlidou 2016: 41-68): 

1. Audiovisual material: It comprises tape-/video-recordings of naturally-occurring communication.

2. Transcribed material: A subset of the audiovisual recordings (cf. 1.) has been transcribed according to the conventions of Conversation Analysis (click here). This material is drawn from different discourse types, with varying degrees of formality, more specifically:

  • everyday conversations among friends and relatives (sample)
  • telephone calls (sample)
  • classroom interaction (sample)
  • television news bulletins (sample)
  • television interviews with politicians (sample)
  • interviews/discussions with Greeks of the diaspora (sample
  • other

The transcribed material exceeds 2,0 million words. Transcriptions vary in detail and quality. 

3. Online material: Part of the transcribed material is available at corpus-ins.lit.auth.gr/corpus/index.html and can be used freely online. It currently consists of:

  • 40 everyday conversations among family and friends
  • 145 telephone calls
  • 17 television interviews with politicians

Access and conditions of use

The Corpus of Spoken Greek was originally compiled for the qualitative analysis of language and linguistic communication, mainly from the perspective of Conversation Analysis. Part of the (transcribed) Corpus, though, can be utilized for quantitative analysis as well, and is freely available online (corpus-ins.lit.auth.gr/corpus/index.html)

For qualitative analysis, if additional material (besides that available online) is required, the Institute of Modern Greek Studies can provide access to further files – the project’s affordances permitting. 

Access is
a) contingent on the detailed explanation of the reasons that necessitate particular types/quantity/form of material, 
b) subject to the Institute’s discretion. 

To request additional material, please complete this form (click here) and send it either via conventional mail to the postal address: 

    Professor Th.-S. Pavlidou
    Corpus of Spoken Greek
    Institute of Modern Greek Studies [Manolis Triandaphyllidis Foundation]
    Aristotle University of Thessaloniki
    GR-541 24 Thessaloniki, Greece 

or electronically to <This email address is being protected from spambots. You need JavaScript enabled to view it.>.

Upon receiving the signed form, a CD with the requested material will be mailed to you. This CD has to be returned to IMGS when your research is completed.

Related bibliography

Pavlidou, Th.-S. (ed.). 2016. [in Greek] Making a Record of the Greek Language. Thessaloniki: Institute of Modern Greek Studies.

Pavlidou, Th.-S. 2012. The Corpus of Spoken Greek: Goals, challenges, perspectives. LREC Proceedings, Workshop 18 (Best Practices for Speech Corpora in Linguistic Research), 23-28.

Pavlidou, Th.-S., Ch. Kapellidi & E. Karafoti. 2014. The Corpus of Spoken Greek (CSG). In: Best Practices for Spoken Corpora in Linguistic Research, Ş. Ruhi, M. Haugh, T. Schmidt & K. Wörner (eds), 56-74. Newcastle upon Tyne: Cambridge Scholars Publishing.