STANDARDIZATION OF SPEECH CORPUS

Li Ai-jun, Yin Zhi-gang

Institute of Linguistics, Chinese Academy of Social Sciences，Beijing,100732

ABSTRACT

Speech corpus is the basis for both analyzing the characteristics of speech signal and developing speech synthesis and recognition systems. The corpus content becomes more and more complicated and the size larger and larger with the development of computation power and the speech technology. Chinese speech corpus can be categorized according to its content, speaking style, channel property, phonetic coverage, dialectal accent or application area.

In mainland China, speech corpus production has got a long term support from various national funds such as the 863 Hi-tech Research and 973 Development Program of China and the National Science Foundation of China. Moreover, almost all of the speech research and development affiliations are developing their own speech corpora.

We have so many different kinds and a large number of Chinese speech corpora that it is rather important to be able to conveniently share these speech corpora to avoid waste of time and money and to make the research work more efficiency. One of the problems in sharing these corpora is the lack of general specifications on corpus collection, annotation and distribution.

The primary goal of this research is to find the standard program of speech corpus, which can make the corpus be established more efficiently and be used or shared more easily.

RASC863（Regional Accent Speech Corpus funded by National 863 Project）, a huge speech corpus on 10 regional accented Chinese, is introduced to illuminate the standardization of speech corpus production.

Chinese regional accent speech corpus is the corpus of spoken Chinese which comprises many regional variants called dialects. Although these dialects employ a common written form, they are mutually unintelligible to a large extent. There are 10 major dialects in China: Mandarin, Jin, Wu, Hui, Xiang, Gan, Hakka, Yue, Min and PingHua. Mandarin is referred to as the common language which covers very large regional areas from northeast to southwest of China, with over 800 million speakers. People from different dialectal areas might not be able to communicate with each other simply because the differences among the dialects are significant. Most people in China are ‘bi-lingual’ speakers who are able to speak their native dialects and Mandarin. Although lots of people can speak Mandarin, they show variation in accent, depending on how well they grasp the common tongue. The Mandarin they speak is always affected by their native dialects phonologically, lexically and syntactically.

The standardization system of RASC863 speech corpus consists of two parts: steps of speech corpus production, specifications should be followed in these steps.

Generally speaking, the corpus production can be divided into 9 steps: making various specifications, preparing for collection, pre-collecting, pre-validation, starting the real collection, annotating, compiling lexical dictionaries or lexical frequency tables, post validation and distribution.

The primary specifications include specification of speakers: describing the information of speakers, such as age, sex, education, accent…;specification of prompt design: describing the rules in prompt design process, the speaking type, phonetics and linguistics request; specification of recording: describing the recording software and the specification of recording equipment and acoustic environment; specification of data: describing the format and index of the data; specification of annotation: describing the annotation system; legal documents; specification of validation: evaluating the value of the corpus; specification of distribution: describing the plan ,rules and media(CD/DVD) of distribution.

Key words: phonetics, speech corpus, standardization, production, specification

Name: Li-Aijun

Sex: Female

Birth: Oct. 1966

Title: Professor

Duty: Director of Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences , Secretary-general of Phonetics Society of China

Research Interest: Speech prosody, expressive speech, accented speech in L2, speech corpus production and annotation.

E-mail: liaj@cass.org.cn, Tel: +86-10-65237408

Name: Yin-Zhigang

Sex: Male

Birth: Oct. 1977

Title: Research Assistant

Research Interest: speech prosody, speech corpus production and annotation

E-mail: yinzhg@cass.org.cn , Tel: +86-10-65237408

===================================================================

RESUME（段落式）

Li-Aijun: Director and professor of Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences. She is also the secretary-general of Phonetics Society of China. She graduated as master of CS in both computer science & engineering department and electronic engineer department in Tianjin University. Her early research work was on phonetics oriented and Klatt liked speech synthesis for Standard Chinese. Recently, her academic interest is focused on speech prosody, expressive speech, accented speech in L2 and

speech corpus production and annotation.

E-mail: liaj@cass.org.cn, Tel: +86-10-65237408

Prof. Yin-Zhigang: Research assistant of Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences. His academic interest is focused on speech prosody, speech corpus production and annotation.

E-mail: yinzhg@cass.org.cn , Tel: +86-10-85195394