Repeated-observations design in language documentation

The methodological approach of this project is based on a repeated-observations design in the development of documentation materials. This decision follows the assumption that linguistic properties vary in at least two dimensions: (a) the variation between speakers, which is pervasive in an endangered language – especially since all speakers are at least bilingual; (b) the variation between linguistic objects, i.e., different words, different lexicalizations of the same syntactic structure, different types of communicative events, etc.

Since these sources of variation are taken for granted, it requires that any linguistic description, i.e., any generalization about data, should be checked against the variation in these two dimensions. Based on these considerations, our vision is to create a database that systematically involves repeated observations on different linguistic objects and different speakers allowing for comparisons between these variables.

In this spirit, our LEXICON and SENTENCE collections contain parallel datasets by four native speakers.&xnbsp;Our TEXT corpus contains five text types that were produced by sixteen speakers (with the same instruction).

Primary data for language documentation

The aim of a language documentation project is to collect and preserve the primary data that linguists collect in the field. This data contains utterances in the documented language and the native speakers' intuition about the interpretation of these utterances. In order to achieve this aim, our project archives:

- the audio file of Urum utterances,

- a native transcription according to a convention made by the project (since the community does not have a writing system): this transcription does not necessarily meet the requirements of phonological accuracy since it is made by a non-linguistically educated native speaker. It may contain intereferences from orthography (especially in Russian words) or influence from the idiolect of the transcriber. The value of this transcription is that it is direct evidence about the sound perception by a native speaker.

- a normed word-by-word translation: it is crucial that we do not provide glossed texts (i.e., a&xnbsp;morpheme-to-morpheme translation) of the object language. The segmentation in minimal units of meaning and the morphemic translation (gloss) is an abstract level of analysis that necessarily depends on further assumptions - that may be established within a linguistic framework. This is not the point of our documentation: what we want to archive in our database, is the intuition of the native speaker about the translational equivalent of any potentially free unit (which roughly corresponds to a word).

- a free translation: A free translation is a global translation of the utterance in a particular context. It may contain information arising through pragmatic inferences. What we are interested in, is the speaker's intuition about the global meaning of the sentence in the particular context; the composition of the sentence meaning out of the meaning of the individual units is a further target that depends on our analysis of the primary data.