UTokyo Repository 東京大学

UTokyo Repository >
122 新領域創成科学研究科 >
13 基盤科学研究系 基盤情報学専攻 >
1221320 博士論文(基盤科学研究系基盤情報学専攻) >

このページ(論文)をリンクする場合は次のURLを使用してください: http://hdl.handle.net/2261/50460

タイトル: Chinese Dialect-Based Speaker Classification and Pronunciation Assessment Using Structural Representation of Speech
その他のタイトル: 音声構造表象を用いた中国語方言に基づく話者分類と発音評価
著者: Ma, Xuebin
著者(別言語): 馬, 学彬
発行日: 2010年9月27日
抄録: In modern speech processing technologies, segmental features of speech are usually represented acoustically by spectrum, which contains not only linguistic information but also extra-linguistic information corresponding to age, gender, speaker, microphone, and so on. If one wants to classify speakers using their utterances purely based on their dialects, only the dialectal differences should be focused on and the extra-linguistic features should be removed or canceled. In fact, for the problems of automatic speech recognition, very similar problems are raised where the linguistic features of speech invariant or robust to extra-linguistic factors are desired. Therefore, a method to build so-called speaker-independent models is studied by collecting the data of many speakers trying to cover all the extra-linguistic features. About some linguistic studies, in order to compare the vowel realizations of different speakers in linguistic and sociolinguistic meaningful ways, normalization techniques are used to capture the differences. However, these methods may not work well in the problem of Chinese dialect-based speaker classification. For this problem, the linguistic features invariant to extra-linguistic factors should be extracted from the dialect utterances of individual speakers. In our previous works, a structural representation of speech is proposed to extract the speech contrasts or dynamics by removing extra-linguistic features from speech and it is already applied to speech recognition, speech synthesis and helping Japanese learning English. In my study, the structural method is further applied to Chinese dialect pronunciations representation and dialect-based speaker classification is achieved by building comparable dialect structures to extract the speaker-invariant purely linguistic features from Chinese dialects. At the beginning, based on the phonological features of Chinese dialects, utterances of syllable units (characters) are proposed as the reading materials to built pronunciation structures. Then several different lists of Chinese written characters, which are original proposed by Chinese dialectologists to check the dialect pronunciation of different speakers, are adopted as the reading materials to built dialect-sensitive comparable dialect pronunciation structures. After that, using the dialectal utterances of the reading materials, dialect pronunciation structure is built for every speaker by calculating the Bhattacharyya distances between the distributions of any pair of his/her utterances. Because Bhattacharyya distance is invariant to affine transformations and extra-linguistic features perform as affine transformations in spectral space, the built dialect pronunciation structure is invariant to extra-linguistic features in speech. Therefore, speaker-invariant dialect-based speaker classification can be achieved by building the dialect pronunciation structures for the speakers and calculating the distances between their pronunciation structures. In order to verify my proposal, several different classification experiments are carried out. At the beginning, a dialect-based speaker classification experiment is carried out. Because publicly available Chinese dialect corpora cover only two or three dialects and cannot be used for this problem, a new database of Chinese dialects is built and the dialect data of 17 speakers are recorded. Then all the data are labeled manually and the syllables are cut and converted into distributions. After that, for every speaker, the BDs between any pair of his/her utterances are calculated and the pronunciation structure is built. Then speaker classification experiment is carried out by calculating the distances between their pronunciation structures. The result shows that the speakers are well classified by their dialects and the result is independent to extra-linguistic features such as the gender and age of the speakers. After that, this structural method is verified by a sub-dialect based speaker classification experiment. At the beginning, a new database of sub-dialects is built and the sub-dialect data of 16 speakers from 4 sub-dialects regions of Mandarin and the data are recorded. Then using the same method as last experiment, sub-dialect pronunciation structures are built and these speakers are classified by calculating the distances between their pronunciation structures. By the result, it is found that the speakers from the same dialect cities are all clustered together and the speakers from the same sub-dialect regions are also mainly classified near to each other, except one exception that 4 speakers from ZhongYuan sub-dialect regions are classified to two different sub-trees. Several possible reasons for it are discussed: these speakers are also graduate students in Tianjin and their sub-dialects may be affected by the sub-dialect there to different degrees; the traditional linguistic classification of these sub-dialects are carried out based on several different features of the whole syllable but our method of structural classification is only focusing on the acoustic features of the finals. Anyway, neither of these possible reasons can be proved. So a new evaluation method is proposed to prove that the dialect-based speaker classification using our structural method is not affected by the features of the speakers. In order to prove that our method can classify speakers by extracting the speaker-invariant linguistic features no matter which kind of dialect are they speaking, new comparison experiments are designed with original dialect data and mimicked dialect data with minimum speaker differences. For these experiments, I carried out some new recordings in China and the data of speakers from 10 sub-dialects of 5 dialect regions were recorded. Then every utterance of this data set is linguistically mimicked by an expert of Chinese dialects and a new data set with fixed speaker identity (minimum speaker differences) is built. After that, using the original and mimicked data separately, dialect-based speaker classification experiments are carried out. It is found that the two results are almost the same as each other, although one is obtained using the dialect data spoken by different speakers and the other is obtained using the dialect data with fixed speaker identity. It means that our method of classify speakers based on their dialects using structural method is really invariant to speakers. Also, our method of structural pronunciation comparison is compared with conventional spectral comparison using data sets with maximum speaker differences. At the beginning, corresponding to the original and mimicked dialect data used above, new data are converted just like they are pronounced by a very tall speaker and a very short speaker and new data sets with maximum speaker differences are built. Then using these data, classification experiments based on spectral comparisons are carried out. The results show that the classifications are affected greatly by the speaker features. After that, these speakers are classified using our structural method and the results show that they are well classified by their dialects and it is not affected by the speaker differences at all. So our method is proved again that it can classify speakers based on their dialects by extracting the purely linguistic features and the result is not affected by the speaker features like the conventional spectral comparison. Further, the structural method is applied to estimating the utterance similarity orders between two speakers. Using the dialect data of 2 Min speakers of different genders and the data of 2 standard Mandarin speakers of different genders, experiments are carried out to estimate the utterance similarity orders among them using our structural method. The results show that very similar similarity orders are obtained for the dialect speakers from the same dialect regions and the results are robust to the genders of the speakers. Also, this structural method is applied to pronunciation assessment of accented Mandarin. At the beginning, the accented Mandarin pronunciation structures are built and compared with the pronunciation structures of standard Mandarin. Then a structural score is obtained for every utterance. After that, these utterances are evaluated manually and the manual evaluated sores are compared with the structural scores. Meanwhile, the data are recognized by a new built Mandarin recognizer and the results are compared with the above two scores. However, the correlation coefficients between these scores are not satisfactory, although some correlations can be found by the results. Therefore, substructures are built to assess the accented Mandarin pronunciations. By adding or deleting utterances to built sub-structures, the pronunciations of accented Mandarin speakers are compared with standard Mandarin speakers and the best correlation coefficient is obtained at about 0.4. Through the above works I have done, it is proved that the structural pronunciation representation can extract the speaker-invariant purely speaker features and classify Chinese dialect speakers based on their dialects. Then we are planning to apply this approach to drawing a new Chinese dialect atlas by calculating the acoustic distances among Chinese dialects, and this result can be further applied to speech processing of multi-dialects. Furthermore, if more data of standard Mandarin pronunciation and well labeled accented Mandarin pronunciation are obtained, I also want to continue the study of pronunciation assessment of accented Mandarin using sub-structure method.
内容記述: 報告番号: 甲26432 ; 学位授与年月日: 2010-09-27 ; 学位の種別: 課程博士 ; 学位の種類: 博士(科学) ; 学位記番号: 博創域第622号 ; 研究科・専攻: 新領域創成科学研究科基盤科学研究系基盤情報学専攻
URI: http://hdl.handle.net/2261/50460
出現カテゴリ:021 博士論文
1221320 博士論文(基盤科学研究系基盤情報学専攻)


ファイル 記述 サイズフォーマット
K-02961.pdf本文(fulltext)5.96 MBAdobe PDF見る/開く



Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - ご意見をお寄せください