Grant Recipient: Univ.-Prof. Dr. Lars Johanson
Seminar für Orientkunde, Johannes Gutenberg-Universität Mainz.
Proposal Author: Dr. Arienne M. Dwyer
Proposal Date: 26 November 1999
This year-long project aims to develop an accessible multimedia data management strategy for archiving and analyzing spoken language. Using our own data on two unrelated unwritten endangered languages of Inner Asia as a basis, we plan to produce a demonstration archival multimedia CD and test an interactive, query-able Web site to be eventually available to the general public. These test products will be developed with cross-platform, software-independent functionality in mind, as much as the current state of the art allows: the basis will be a multilingual database with text markup in XML.
Salar (Turkic) and Monguor (Mongolic) provide a two rigorous test cases of endangered cultural documentation: audio and video files will be accompanied by levels of transcription (phonemic, phonetic, orthographic), as well as samples with morphological markup and musical notation.
We are an interdisciplinary team of anthropologists, a linguist, and a humanities computing expert, with local researchers at our core. Existing digital sound and video recordings will be supplemented by a short fieldwork period.
Salar and Monguor belong to the greater Sprachbund of Amdo Tibet, today part of China’s northwestern Qinghai province. Both are unwritten languages with rapidly diminishing fluent native speakers, particularly in the younger generation. Oral art forms central to both cultures are only known to the oldest generation; neither schooling nor media are available in these languages. Both languages are under intense pressure from the dominant languages, Chinese and Tibetan. Several plans by native speakers to develop ortho-graphies and document their own languages have floundered due to a lack of equipment, computer know-how, and publication possibilities.
The value of documentation extends far beyond their “mere” preservation for speakers, semi-spea-kers, and specialists toiling away on isolated and obscure languages. Both languages constitute crucial “missing links” in the study of Central Asian linguistic development in their preservation of archaic features of their respective language families (Turkic and Mongolic). They also provide two sets of crucial data on an under-investigated linguistic area, and are thus of value for language contact and creolization studies.
This pilot project brings data from two previous collaborative research projects on the two languages as the basis for a multimedia endangered-language database prototype. The database will have both an archival and a research function: it will archive audio, textual, and visual data, and can be searched for linguistic or cultural data. Such a generic database can serve as the foundation for a variety of future end products, depending on the project: archival CDs, interactive multimedia CDs and/or Web sites, dictionaries, grammars, a book/CD of texts, of songs, rituals, etc. The goal is to use the widest possible variety of data from two unrelated languages to develop a model which can be used and adapted for research on other languages and cultures. Target end-users are native speakers – also potentially children – as well as researchers in fields such as linguistics, folklore, ethnography, history, literature, and music.Our research team brings together not only the data but also the personnel of two previous successful collaborative research projects on Salar and Monguor, respectively. We have many hours of high-quality digital audio recordings of a variety of oral art genres, as well as extensive still and video images. Much of this material has been transcribed; some has been published. It awaits, however, systematization into a central database, which we would develop and test in Mainz. The database system would be implemented and tested on our second data set in Xining, China, with the research team. Supplementary data would be collected as needed, and the database functionality would be evaluated, particularly by native-speaker-researchers. From this basis, a queryable user interface will be developed in Mainz to make the material available on CD-ROM and on the Web.
In the archiving phase (the first four months of the pilot project), existing Salar audio and transcribed audio data will be used to develop the database prototype: digital audio “texts” (e.g. a conver-sation or story) will be transferred to computer, and catalogued with metadata on the production and recording. Each audio text will be linked to a series of transcriptions: phonetic and phonemic (here, t- and p-transcripts in the International Phonetic Alphabet), orthographic (o-transcripts in a practical orthography) and, if applicable, a transcript in musical notation. From each phonemic transcript, a morphemic m-transcript will be created. The four textual transcription systems allow the text to be searched for phonological, morphosyntactic, or lexical information; they also allow the entire text to be perused in a single transcription system, or in translation. This archiving phase can be schematized as follows:

In the next phase, the data is prepared for eventual dissemination (via e.g. CD or the Internet). This Tagging Phase involves marking up the data in a semi-automated process for morphological and grammatical, first in a database (with XML proto-tags), to be later extracted to a stand-alone XML format.

In the Information Retrieval Phase (below), the linked relational database will be converted to an XML database file, consisting of a sequence of database records in XML form. Once accomplished, a user interface will be developed to allow full querying of the data in an interactive CD and beta Web site.

It is expected that Phases I & II would be implemented initially in Mainz (months 1-4); this data system prototype would be brought to our field site in Qinghai, China for testing on our second data set with native-speaker-researchers, continued tagging, and model refinement (months 5-6); and the latter half of the year (months 7-12) would be devoted to developing front-end user delivery systems (i.e. the basis for a multimedia CD and a Web site). Concomitantly, the teams in China and Germany would continue to mark up data in order to enlarge our test database.
The emphasis of this pilot project is on developing a data systems prototype that can be used for any language-documentation research. It is designed with collaborative (native-speaker and outsider) research in less-than-ideal circumstances in mind, hence (1) a two-month field consultation phase is essential, and (2) in data system design every effort has been made to make a variety of end products (archival text-only CDs, stand-alone multimedia CDs, a Web site) which are as accessible as possible (e.g. avoiding where possible expensive hardware or software).
For the last eight years, our team has been (separately and together) collecting and analyzing a broad spectrum of high-quality audio and visual data on the two languages in question. We also bring to the project years of experience in field-testing realistic, workable data-management systems for linguists.
i. Salar and Monguor: Genetic affiliation and language-contact situation.
· Genetic affiliation (some scholars claim both languages belong to a common Altaic family)
Salar: Family: Oghuz (SW) Turkic Main varieties: Eastern, Western
The Salars are in origin Oghuz Turkic[1] from Central Asia (Transoxiana), who settled in their present homeland in Northern Tibet (now Qinghai) over six centuries ago; small Salar populations are found in other parts of Qinghai, neighboring Gansu, and in the Xinjiang Uigur Autonomous Region. The Turkic component of the language preserves many important Old Turkic features no longer found in the other Turkic languages of the region (Dwyer 2000); Salar remains one of the least-investigated Turkic languages. Salar has two dialects, Eastern (the main Salar dialect, spoken in Xunhua, Hualong, and Gansu) and Western (in Ili, Xinjiang); their considerable differences are due almost entirely to language contact (see below).
Monguor: Family: SE Mongolic[2] Main varieties: Minhe, Huzhu, Niandehu/Baoan, Wutun
Middle Mongolic (13th-14th c.) phonological and lexical features in Monguor indicate that its speakers were geographically separated from the central Mongolian groups and settled in their present locale during this period, if not earlier. Chinese historical records suggest that Monguor ethnogenesis, like that of the Salars, began with their arrival in Amdo Tibet in the early 13th c. as a part of the invading Mongolian army; certain oral accounts suggest an earlier migration from Mongolia proper.
Within what we here call “Monguor” (known in Chinese scientific literature as Tu), four main language varieties can be identified: Huzhu, Minhe, Niandehu/Baoan, and Wutun. The latter two are highly divergent language communities in three Tongren county townships; Wutun has aroused a good deal of interest in recent creole and language contact research, but remains woefully under-investigated. Variation between the four varieties is also due to the intensity of language contact, especially with Tibetan.
· Language contact
The language contact situation in Inner Asia/Northwest China is as complex as in the Balkans, and invites typological comparison with contact environments in other parts of the world. This project is part of a broader effort to develop a clear and detailed picture of the individual diachronic developments and language interactions of this particular region.
Both Salar and Monguor have been spoken for at least the last 600 years in the epicenter of a Tibetan-Mongolic-Chinese contact zone. Bi- and trilingualism (in the native language plus Northwestern Chinese and/or Amdo Tibetan) is the norm for much of the population, particularly for males. Education and media are available only in Chinese and, depending on the region, to a lesser extent also in Tibetan. Academics have created an orthography for each language, but they are not used among the population.[3]
The significant variation between certain sub-varieties of Salar and Monguor is due to the length and intensity of contact with the two dominant languages. Within Eastern Salar, Xunhua exhibits Chinese-type innovations, while Hualong county across the river shows the most Tibetan contact-induced variation. The geographically-distant Western Salar, with Uyghur and Kazakh as contact languages, is mostly unintelligible to Eastern Salar speakers. Huzhu Monguor exhibits significant variation (Limusishiden & Stuart 1998), likely due to Tibetan influence; other areas, such as Minhe, show less variation.
ii. Degree of Endangerment
The number of fully-fluent native speakers of both languages is decreasing rapidly.[4] The high official population statistics (163,800 Monguor, 77,300 Salars (1996 Qinghai Statistical Yearbook) belie the low number of actual speakers: in many areas, only 30 percent of the population has an active command of the language.[5] Generational differences are particularly acute for certain areas (for Monguor, in all areas; for Salar, in Xunhua): in both groups, speakers over 60, particularly women, are fully fluent and competent in oral art forms and are native-language-dominant; speakers over 35 have a passable-to-fluent command of their native language, but are multilingual and have no command of oral art forms; children may grow up with one of the dominant languages as their native language.
Those areas where Salar and Monguor language and folklore are best preserved are characterized by remoteness, extreme poverty, and lack of education. These include Munda and Ashnu (Eastern Salar), and villages in Huzhu and Minhe (Monguor).
The lack of native-language schooling and a writing system to Salar and Monguor children makes the future of these languages bleak.
iii. Project goals
Our overall goal is to establish a prototype for linguistic data management which we and other linguists can use to (1) archive, (2) annotate, (3) analyze, and (4) publicly disseminate the data. The prototype will allow for linguistic annotation at various levels, such as phonological, lexical, part-of-speech and morphological analysis, phrase-structure analysis, syntax, semantics, pragmatics, stylistics, etymology; it will also be developed with accessible, cross-platform computing applications in mind.
Our specific goals for the pilot year are as follows:
iv. Methodology
Archiving phase: sound data backup
All existing tape recordings (analog and digital = 25 hours) will be catalogued and copied onto CDs in Germany and China, and archived at the University of Mainz and the Qinghai Minorities College, respectively. One set of analog copies must be made for on-going transcription work in China. Existing digitized text transcripts of Salar, which already amounts to ca. 400 pages of phonetic transcripts, will be continuously archived on CD.
Archiving phase: Database setup
The Mainz team will first establish the structure of the database, including laying the groundwork for later XML tags: what we here call XML proto-tags will be incorporated into the database.
A selected body of digitized audio recordings, their (phonetic) text transcripts, and all associated metadata (information about the speaker(s), locales, and circumstances of recording) will then be incor-porated into this database. Phonetic texts will continue to be prepared and checked, PERL scripts will be written for text conversion (phonetic to phonemic, phonemic to orthographic), practical orthographies will be established for the two languages (with native-speaker-scholar feedback in China). Texts will be indexed (linked) to their corresponding audio recordings at the utterance level.
Through this first test of the database with “live” data, we will be able to refine the data structures, proofread the texts, and make adjustments to the tag set as needed. Preliminary morphological tagging will be developed and tested during this phase; it is expected that a semi-automated tagger such as PC-Kimmo will be used. Details of the file annotation and text markup are as follows:
Archiving phase: file annotation
Text markup
A research assistant in Mainz will concurrently be entering text data, and comparing sister transcript versions for accuracy. In Xining, China, the Monguor researchers on our team will evaluate orthography recommendations and transcribe collected data for digitization.
Collaboration and Fieldwork Phase
Dwyer will meet the Qinghai team in Xining, China; an existing desktop computer will be upgraded and the database system and associated software and scripts will be installed. Dwyer will then train the team on audio digitization and the data management system, including creating and proofreading of the parallel text transcriptions. Selected existing videos will be digitized and incorporated into the database and discussed.
A one-month fieldwork period will directly follow, principally to collect audio data on Wutun, Datong, and Tianzhu Monguor, likely conducted primarily by Li and Zhu. The target fieldwork sites will be further determined and investigated as needed. Dwyer will likely collect supplementary audio and visual material on the Salars, especially if video opportunities are present.. After fieldwork, both existing and new data will be integrated into the database, with structural adjustments made as needed. With training, data entry, and intensive collaboration, the phase will take a month. At the end of the collaborative period, archival CDs will be made of the database and a number of selected audio and video recordings.
Information Retrieval Phase
Back in Mainz, new and existing Salar still and video images will be incorporated into the database, in accordance with the MPEG-4 standard. This database will then be prepared for platform-independent applications: its markup, which will already include XML proto-tags, will be converted to full XML markup.
This implies the conversion of database records (fields) into one large sequenced XML format file. This XML database file, together with the user query interface being developed, will form the basis for our queryable multimedia beta products. For the interactive multimedia CD, our goal is a stand-alone product that an interested client in China or elsewhere can use without special software on a computer that is not state-of-the-art. For this reason, the client interface (and continued database work) will be done in Filemaker Developer. The pilot CDs would include a short video with accompanying transcription, short sample annotated texts and linked audio files, and a dictionary-like interface (with audio and some visual images) queryable for phonetic, morphosyntactic, and metatextual information. We expect to make use of the international standard for multimedia, the SMIL (Standard Multimedia Integration Language), which is under developement by the W3C (World Wide Web Consortium).
The queryable endangered-language Web site will be developed much on the same principles (accessibility and cross-platform XML emphasis) and with much the same content, as far as can be developed in our very limited time. Existing shareware authoring and browsing tools will be used where appropriate . We anticipate laying the groundwork for a beta XML Web site and for a Web site which allows access to our Filemaker database.[6]
Concurrently, in China text transcription and audio/video capture of Monguor materials will continue. In Mainz, the research assistant will continue to enter and proofread data and XML markup.
i. An intelligent balance of Multimedia and Text Applications
In recent years, there has been an enormous amount of hype for multimedia applications. Having worked on such projects ourselves, we would caution that one should carefully weigh the considerable time investment involved in multimedia productions against their advantages. Well-designed and carefully constrained multimedia projects support basic linguistic research beautifully, by providing the original data on which transcriptions are based, and drawing in native-speaker or generalist-outsider users who might find a pure text-base application too abstract. But even small multimedia applications take enormous amounts of resources[7]; worse, we have seen far too many projects that are so heavy on multimedia that linguistic analysis is lacking. We would do well to remember that our principle focus is the delivery of data for long-term scientific reference and research. That is best accomplished, we believe, with searchable text transcriptions (linked to audio files) at the core, combined with an intelligent selection of still and moving images. Our goal to deliver a completed product: to make a data management prototype that is of realistic size and scope, and is useful to other linguists.
ii. Parallel text formats are necessary to reach the broadest audience
Transcribed data from unwritten languages adds several layers of complexity to decisions about data markup, storage, and querying. Each transcription and subsequent text format is an act of interpretation; thus, the original recording should always be made available. To make such recordings available to the widest audience, however, a variety of transcriptions are imperative. Each of the text file formats proposed here addresses a particular need: the phonetic text – a narrow International Phonetic Alphabet transcription with additional prosodic marking – constitutes the transcription on which all other text formats are based. It and the phonemic text are primarily of interest to phonologists. The phonemic text (essentially, the phonetic text without redundant linguistic features) is derived semi-automatically by rule application from the phonetic text. The morphemic text is the phonemic text annotated with morphosyntactic information, of interest to grammarians, typologists, etc. All of these parallel texts (parallel in the sense that they stem from and are linked to a unique audio recording) also contain a header with metatextual information on the speaker(s), locale(s), and recording circumstances. This is of broad interest to folklorists, sociolinguists, and anthropologists, for it allows material to be isolated according to gender, age, education level, or region.
These three text formats – phonetic, phonemic, and morphemic – constitute the core transcriptions.
To address a wider audience, there are three additional linked text formats: orthographic, translation, and where appropriate, musical transcription. The orthographic texts – essentially phonemic texts modified to accommodate a lower-ASCII keyboard – are designed to be read by interested native speakers and generalist readers. Translations are keyed on an utterance-by-utterance basis to the core transcriptions, and are as such an interpretation aid for linguists, as well as a stand-alone source for folklorists. Finally, musical transcriptions can be of immense value in areal-typological ethnomusicology.
A selection of these marked-up texts and their associated audio constitute the groundwork for interactive multimedia CDs and an endangered-language Web site. Dwyer has worked on a similar project aimed at the revival of a highly-endangered Native American Indian language; she found that such projects can excite interest not only among academics, but among native-speaker (or semi-speaker) children as well. Suddenly being able to write and type in the native language can rouse great enthusiasm among teenagers; having a CD with a language tools and learning games can be a source of community pride.
We will thus lay the basis for both interactive multimedia CDs for both languages and a Web site with the following components: (1) an interactive sound-text-image dictionary, where any word can be clicked on to hear it pronounced, both in isolation and in the context of an sample utterance, and where speaker information can be accessed; (2) videos of rituals with side-by-side scrolling transcription and translation; (3) an interactive grammatical description.
Our linked parallel transcripts could, in a later phase of the project, be made available in a parallel-corpus-like format, and we would build on the experience of other linguistic corpora available on the Web.[8]
iii. On the choice of XML as a text markup language
XML (eXtensible Markup Language) is a markup language expressly designed for use on the Web. It approximates a combination of HTML (HyperText Markup Language, the markup used on the Web), with a subset of SGML (Standard General Markup Language). The latter is a more comprehensive standard that is, however, not nearly so widely used and has little available software. XML, as the coming standard for Web markup, has in contrast ever more available compatible applications. XML thus offers the comprehensive-ness of SGML and the accessibility of HTML. Public-domain tools are used to validate all documents.
An advantage of marking up text in XML is that, through the linking of content and tagging, one can have both text and meta-information in one place. By contrast, in a database each tagged item must be duplicated redundantly in a linked database. Our choice of tagging tools (at present, EditTime or XMetal) is dictated by the following concerns: that both tags and/or edited text can be viewed, that the tools are both Unicode- and SGML-compatible.[9] We can thus make use of the Text Encoding Initiative guidelines.
iv. On the choice of a one-time intermediary database phase
We do not propose to go directly from raw data to texts marked up in XML, because the complexities of the hierarchical markup of spoken-language data are such that the markup process would ultimately be slower than creating an intermediate database. Our ultimate goal, however, remains an XML-based application in Web. Of the presently-available options for managing raw data into a widely-accessible, queryable form, the best solution we see is the use of a database to accelerate the otherwise long process of developing and testing a tagging system to get the data into a queryable system. It is our experience that it is only during the course of compiling a database that the final set of necessary data structures come to light. This rapid prototyping would be, we should stress, a one-time measure for the development of the database model that will be thereafter ported to XML. Future projects based on this model would be directly marked up in XML.
This procedure is not unusual; Web applications are usually originate from non-Web projects, even when the opposite is claimed. The advantages to an intermediary-database approach are: (1) data structures can be easily created without having to define a formal data model within the text, (2) fields can be defined and altered in proofreading passes much more easily than adding and altering XML tags, and (3) database applications generally run more quickly. In any case, it is crucial that the planned XML markup be incorporated into the Filemaker database system, in order to facilitate the conversion of data.
We chose Filemaker because it allows combining different fonts in one field, which is useful for multilingual work. Importantly, this facilitates intermediary work in non-Unicode (8 bit) fonts for older computers – crucial for our collaborative work in China. In short, it will be a learning database that is powerful enough to accomplish our goals.
A query system for linguists should allow for the following information to be searched for: (1) partial text searches, of lexical information (e.g. headword, variants, part of speech, etymology, with examples in textual context) and grammatical information, and (2) full text searches, of realia (things, people, and events), and of macrolinguistic information (such as pragmatic and discourse information).
The most straightforward approach to make sample data available on a beta Web site and on a multimedia CD is at first a stand-alone application made with Filemaker Developer. We would integrate short videos and audio clips as multimedia files that the user can access with a mouse-click, together with texts searchable for linguistic information.
These would constitute the rapid prototypes. Then an limited part of the data (several texts and their linked audio files) would be converted in XML and would be saved in an XML-compatible database[10], which would provide data to clients over the Web. On the client side, at the present time it is not clear of a special web-browser must be licensed, or if the XML-compatibility of current Web browsers will be improved to the point where communication with the database will be possible.
One may ask why we do not use Java programming immediately in this pilot phase. We find Java an good option for future phases, but we must also plan for use on computers today. Java applications have the advantage of being platform-independent, but they operate on today’s computers painfully slowly – even the top-end ones. When one considers that many of the target users of endangered language CDs and/or Web applications will not have up-to-date computers (and also do not have Web access – especially those native-speaker communities) – then Java is, at present, an unattractive option. In five years, however, we would expect even basic computer systems to be fast enough to run Java applications. Thus, it is useful to plan to eventually convert a proprietary, platform-dependent database system like Filemaker to an XML-based Java application.
Dr. Limusishiden (Li Dechun) has had extensive experience collecting folklore in Huzhu. He is a fluent Monguor speaker, and, through a wide range of relatives and acquaintances, has access to Monguor-speaking areas throughout Huzhu and Ledu counties. He also has had much prior experience in recording folklore on audio recorder and transcribing the results.
Zhu Yongzhong, also Monguor, has been researching folklore topics in Minhe county since 1993. He has, at present, assembled many hours of both audio and video recordings of a wide range of folklore presentations from throughout the Minhe Monguor area. He has also transcribed approximately 500 pages of daola that we eventually hope to be able to publish in the original as well as English translation. Mr. Zhu’s supervision of many small-scale development projects throughout the Minhe Monguor region has provided him with many contacts. This, and his fluency in several local Chinese dialects and Monguor, and the time he has spent on Monguor linguistics, make him a particularly valuable member of our team.
Dr. Stuart, in cooperation with local researchers, has probably done more for endangered languages and cultures in Qinghai, China than anyone else past or present. He regularly encourages local colleagues and students to document their own cultures, and has on numerous occasions provided them with recording equipment to document rituals. He is the co-author of the first book-length collection of Salar texts in IPA transcription, orthography, and translation, based on material collected by his native-speaker co-authors. In addition to Dr. Stuart’s endangered-language work, which brings him no monetary reward, he is also involved in non-governmental small-scale community development projects in the area, and has been an instructor of English since 1984 in Inner Mongolia and Qinghai, China.
Dr. Arienne Dwyer, the project coordinator, anchors the linguistic and computing end of the project. She has over a decade of field experience in Northwest China, including a two-year research stay in Eastern Turkestan and linguistic, folkloric, and ethnomusicological field work on Northwestern Chinese, on modern Uyghur dialects, on Kazakh, and especially Salar. Since 1992, she has initiated long-term two collaborative language-documentation projects on Salar and Uyghur dialects, based overwhelmingly on only her private funds. For the latter two projects, Dr. Dwyer has developed a database system for text and audio data.
Since the Salar data for the current pilot project is already the result of earlier collaborative and solo research, to keep our numbers within reasonable bounds a Salar researcher has not been included in the pilot phase. However, we will consult with local Salar researchers as well.
Our computing consultant, Reinhard Hiß, M.A., has been the systems operator at the University of Mainz’s Sonderforschungsbereich project 295 since 1997. His specialty is multilingual, cross-platform computing solutions; he has developed an SGML system for the documentation of Ethiopian languages. He is also the developer of a database system for the Byzantinische Zeitschrift and published an index of the same in 1998.
Months 1–4: Data systems development
Months 5–6: Collaboration and fieldwork
Months 7–12: Data incorporation, realization of archival CD and test Web site
Stand: 26 November 1999
Univ.-Prof. Dr. Lars Johanson, Professor (C4) für Turkologie Johannes Gutenberg-Universität Seminar für Orientkunde D-55099 Mainz, Germany johanson@mail.uni-mainz.de
[1] As an Oghuz/Southwestern Turkic language, Salar is most closely related to Turkmen, Azeri, and even Turkish, and less related to the Eastern and Central Turkic languages of Inner Asia, such as Uyghur and Kazakh.
[2] Monguor’s classification as a Southeastern (sometimes known as peripheral) Mongolic language groups it with Baonan, Daghur, Dongxiang, and Enger (Shera Yugur/Eastern Yugur), as distinct from central Mongolic languages such as Khalkha and Buriat.
[3] Both Monguor and Salar Latin-script orthographies were developed in the mid-1980’s (largely for planned dictionaries, one of which was published (of Huzhu Monguor, Li 1988), they were never used in education.
[4] We define fluency here as not just conversational fluency, but also competence in other oral forms central to culture: storytelling, oral history, singing, oratory, ritual.
[5] E.g. Gandu Salar and Dahejia Monguor. Estimates based on fieldwork by Dwyer in 1992 and by Zhu, Üyediin Chuluu and Stuart (1995:200), respectively.
[6] Examples of Filemaker-based endangered-language Web sites world-wide include the Comparative Bantu Online Dictionary (http://www.linguistics.berkeley.edu/CBOLD/), Ingush (http://ingush.berkeley.edu:7012/), Maliseet-Passamaquoddy (http://ultratext.hil.unb.ca/Texts/Maliseet/dictionary/index.html), and Mambila (http://lucy.ukc.ac.uk/dz/connell/project.html).
[7] Such multimedia work is so time-consuming that projects are rarely completed: a quick perusal of language-data projects on the Web shows many which have not been worked on for a year or more.
[8] On parallel corpora, see Bonhomme et al. 1997; for an index of projects and tools, see the Linguistic Data Consortium.
[9] Unicode compatibility is essential for multilingual projects, especially those working with 16-bit fonts (we use e.g. Chinese). SGML, as a frequently-used coding in linguistics, can be converted into XML and vice-versa, when needed.
[10] E.g. Poet, Adabas, Sybase for internet applications; decision on the particular software would be made a project start.