{"created":"2021-03-01T06:22:11.937728+00:00","id":5501,"links":{},"metadata":{"_buckets":{"deposit":"77355ec5-3d1d-48ef-9210-49898a2eb293"},"_deposit":{"id":"5501","owners":[],"pid":{"revision_id":0,"type":"depid","value":"5501"},"status":"published"},"_oai":{"id":"oai:repository.dl.itc.u-tokyo.ac.jp:00005501","sets":["6:209:392","9:233:280"]},"item_7_alternative_title_1":{"attribute_name":"その他のタイトル","attribute_value_mlt":[{"subitem_alternative_title":"音響空間からジェスチャ空間への写像に基づくリアルタイム音声生成系におけるジェスチャ設計"}]},"item_7_biblio_info_7":{"attribute_name":"書誌情報","attribute_value_mlt":[{"bibliographicIssueDates":{"bibliographicIssueDate":"2012-03-22","bibliographicIssueDateType":"Issued"},"bibliographic_titles":[{}]}]},"item_7_date_granted_25":{"attribute_name":"学位授与年月日","attribute_value_mlt":[{"subitem_dategranted":"2012-03-22"}]},"item_7_degree_grantor_23":{"attribute_name":"学位授与機関","attribute_value_mlt":[{"subitem_degreegrantor":[{"subitem_degreegrantor_name":"University of Tokyo (東京大学)"}]}]},"item_7_degree_name_20":{"attribute_name":"学位名","attribute_value_mlt":[{"subitem_degreename":"博士(工学)"}]},"item_7_description_5":{"attribute_name":"抄録","attribute_value_mlt":[{"subitem_description":"Nowadays, most of speech synthesizers are those which require symbol inputs, such as TTS (Text-to-Speech) converters. The quality of synthesized speech sample produced by those speech synthesizers is improving. However, it still has some drawbacks, for example, in emotional speech synthesis or in expressive pitch control. On the other hand, synthesis methods which do not require symbol inputs, such as articulatory synthesis, are effective for continuous speech synthesis and pitch control based on dynamic body motion. Therefore they attract research interest and several applications have been proposed. A dysarthric engineer, Ken-ichiro Yabu, developed a unique speech generator by using a pen tablet. The F1-F2 plane is embedded in the tablet. The pen position controls F1 and F2 of vowel sounds and the pen pressure controls their energy. Another example of speech generation from body motions is Glove Talk proposed by Sidney Fels. With two data gloves and some additional devices equipped to the user, body motions are transformed into parameters for a formant speech synthesizer. In this study, we consider the process of speech production as media conversion from body motions to sound motions. Recently, GMM-based speaker conversion techniques have been intensively studied, where the voice spaces of two speakers are mapped to each other and the mapping function is estimated based on a GMM. This technique was directly and successfully applied to estimate a mapping function between a space of tongue gestures and other speech sounds. This result naturally makes us expect that a mapping function between hand gestures and speech can be estimated as well. People usually use tongue gesture transitions to generate a speech stream. But previous works showed that tongue gestures, which are inherently mapped to speech sounds, are not always required to speak. What is needed is a voluntarily movable part of the body whose gestures can be technically mapped to speech sounds. However, Yabu and Fels use classical synthesizers, i.e. formant synthesizers. Partly inspired by the remarkable progress of voice conversion techniques and voice morphing techniques in this decade, we are developing a GMM-based Hand-to-Speech conversion system (H2S system). Unlike the current techniques, our new synthesis method does not limit the input media. Therefore, our technique would be useful in assistive technology, in which devices are tuned for person to person, and in performative field, in which people pursue the human capability of expression. In this study, we focus attention on the design of the system. As an initial trial, a mapping between hand gestures and Japanese vowel sounds was estimated so that topological features of the selected gestures in a feature space and those of the five Japanese vowels in a cepstrum space are equalized. Experiments showed that the special glove can generate good Japanese vowel transitions with voluntary control of duration and articulation. We also discussed how to extend this framework to consonants. The challenge here was to figure out appropriate gestures for consonant sounds when the gesture design for vowels is given. We found that inappropriate gesture designs for consonants result in a lack of smoothness in transitional segments of synthesized speech. We have considered the reason to be: (1) the positional relation between vowels and consonants in the gesture space and that in the speech space were not equivalent, (2) parallel data for transition parts from consonants to vowels did not correspond well. In order to solve those problems, we have developed a Speech-to-Hand conversion system (S2H system, the inverse system of H2S system) trained from parallel data for vowels only to infer the gestures corresponding to consonants. Listeners evaluated that an H2S system, which exploits gesture data for consonants derived from an S2H system, can generate more natural sounds than those trained with heuristic gesture design for consonants. Those natural speech generated by H2S system trained exploiting data generated by S2H system were, however, obtained only when input gestures were the same as the one which generated by S2H system. S2H system sometimes output gestures whose dynamic range is too large or which is not smooth enough. In those cases, it was difficult for users to form those gestures in realistic time. In this thesis, we compensated those problems with two ways: (1) reduce the dynamic range by setting the optimal weight for the gesture model (2) smooth the gesture trajectories by considering delta features. Exploiting parallel data for consonants derived from a S2H system, we also implemented a real-time Hand-to-Speech conversion system and evaluated the effectiveness. Subjective user evaluations showed that almost a half of the phonemes, which are generated by our H2S system are perceived correctly and that this system is effective enough to generate emotional speech.","subitem_description_type":"Abstract"},{"subitem_description":"音声合成技術は, TTSに代表される, 文字や記号を入力とする合成方式と, 調音音声合成に代表される, 文字や記号を介さない合成方式に大別される. 前者と比較して後者は, 運動の連続性に基づく滑らかな合成音の生成や, 合成音における持続時間やピッチのリアルタイム制御においてその有効性が注目されており, 芸術的歌声生成, 教育応用, 構音障害者支援など, 様々なアプリケーションが提案されている. 本研究では文字や記号を介さない音声生成として, 構音器官以外の身体運動から直接音声を生成する新しいシステムを提案する.","subitem_description_type":"Abstract"}]},"item_7_dissertation_number_26":{"attribute_name":"学位授与番号","attribute_value_mlt":[{"subitem_dissertationnumber":"甲第27932号"}]},"item_7_full_name_3":{"attribute_name":"著者別名","attribute_value_mlt":[{"nameIdentifiers":[{"nameIdentifier":"11460","nameIdentifierScheme":"WEKO"}],"names":[{"name":"國越, 晶"}]}]},"item_7_identifier_registration":{"attribute_name":"ID登録","attribute_value_mlt":[{"subitem_identifier_reg_text":"10.15083/00005492","subitem_identifier_reg_type":"JaLC"}]},"item_7_select_21":{"attribute_name":"学位","attribute_value_mlt":[{"subitem_select_item":"doctoral"}]},"item_7_subject_13":{"attribute_name":"日本十進分類法","attribute_value_mlt":[{"subitem_subject":"547","subitem_subject_scheme":"NDC"}]},"item_7_text_22":{"attribute_name":"学位分野","attribute_value_mlt":[{"subitem_text_value":"Engineering (工学)"}]},"item_7_text_24":{"attribute_name":"研究科・専攻","attribute_value_mlt":[{"subitem_text_value":"Department of Electrical Engineering and Information Systems, Graduate School of Engineering (工学系研究科電気系工学専攻)"}]},"item_7_text_27":{"attribute_name":"学位記番号","attribute_value_mlt":[{"subitem_text_value":"博工第7700号"}]},"item_7_text_4":{"attribute_name":"著者所属","attribute_value_mlt":[{"subitem_text_value":"東京大学大学院工学系研究科電気系工学専攻"},{"subitem_text_value":"Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo"}]},"item_creator":{"attribute_name":"著者","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"Kunikoshi, Aki"}],"nameIdentifiers":[{"nameIdentifier":"11459","nameIdentifierScheme":"WEKO"}]}]},"item_files":{"attribute_name":"ファイル情報","attribute_type":"file","attribute_value_mlt":[{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2017-06-01"}],"displaytype":"detail","filename":"37097093.pdf","filesize":[{"value":"11.3 MB"}],"format":"application/pdf","licensetype":"license_note","mimetype":"application/pdf","url":{"label":"37097093.pdf","url":"https://repository.dl.itc.u-tokyo.ac.jp/record/5501/files/37097093.pdf"},"version_id":"ca5fc897-4437-4484-8a91-49fe0be183ef"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourcetype":"thesis","resourceuri":"http://purl.org/coar/resource_type/c_46ec"}]},"item_title":"Gesture design for a real-time gesture-to-speech conversion system based on space mapping between a gesture space and an acoustic space","item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"Gesture design for a real-time gesture-to-speech conversion system based on space mapping between a gesture space and an acoustic space"}]},"item_type_id":"7","owner":"1","path":["280","392"],"pubdate":{"attribute_name":"公開日","attribute_value":"2014-02-24"},"publish_date":"2014-02-24","publish_status":"0","recid":"5501","relation_version_is_last":true,"title":["Gesture design for a real-time gesture-to-speech conversion system based on space mapping between a gesture space and an acoustic space"],"weko_creator_id":"1","weko_shared_id":null},"updated":"2022-12-19T03:46:52.727083+00:00"}