Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information

Shibata, Tomohide

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

{"_buckets": {"deposit": "88decfe4-f921-4e2b-b7da-0961cad8247d"}, "_deposit": {"id": "2385", "owners": [], "pid": {"revision_id": 0, "type": "depid", "value": "2385"}, "status": "published"}, "_oai": {"id": "oai:repository.dl.itc.u-tokyo.ac.jp:00002385", "sets": ["280", "330"]}, "item_7_alternative_title_1": {"attribute_name": "その他のタイトル", "attribute_value_mlt": [{"subitem_alternative_title": "言語情報と映像情報の統合による作業教示映像の構造的理解"}]}, "item_7_biblio_info_7": {"attribute_name": "書誌情報", "attribute_value_mlt": [{"bibliographicIssueDates": {"bibliographicIssueDate": "2007-03-22", "bibliographicIssueDateType": "Issued"}, "bibliographic_titles": [{}]}]}, "item_7_date_granted_25": {"attribute_name": "学位授与年月日", "attribute_value_mlt": [{"subitem_dategranted": "2007-03-22"}]}, "item_7_degree_grantor_23": {"attribute_name": "学位授与機関", "attribute_value_mlt": [{"subitem_degreegrantor": [{"subitem_degreegrantor_name": "University of Tokyo (東京大学)"}]}]}, "item_7_degree_name_20": {"attribute_name": "学位名", "attribute_value_mlt": [{"subitem_degreename": "博士(情報理工学)"}]}, "item_7_description_5": {"attribute_name": "抄録", "attribute_value_mlt": [{"subitem_description": "To perform real-word information processing, such as intelligent robotics, multimodal dialogue system and video processing, it is essential to integrate several media processing technique such as natural language processing, speech recognition and image analysis. From the viewpoint of natural language processing, since language in the real world is strongly depends on the scene, it is important to understand utterances in accordance with the scene. This thesis focuses on handling video contents. Among several types of videos, in which instruction videos (how-to videos) about sports, cooking, D.I.Y., and others are the most valuable, we focus on cooking TV programs. In realizing flexible utilization/access of video contents, the crucial point is the structural understanding of their contents, which requires the interpretation of utterances based on wider contexts including the scene. Chapter 2 describes basic linguistic analysis of cooking instruction utterances (closed caption texts). First, we perform anaphora resolution, which is inevitable to detect the discourse structure or correspond linguistic information to visual information. We build an anaphora resolution system based on the large-scale case frame. Next, we detect utterance-type of a clause of each utterance. In cooking instruction utterances, while explanations of actions are dominant, there are several types of utterances such as declaration of beginning of series of actions, tips of actions, notes, etc. We classify cooking instruction utterance and recognize utterance-type by clause-end patterns. Then, we analyze the discourse structure of instruction utterances. This analysis is performed by integrating the anaphora resolution result, utterance-type and generic discourse structure rules, which consider cue phrases and word chaining. Chapter 3 proposes an unsupervised topic identification method integrating linguistic and visual information based on Hidden Markov Models (HMMs). Identified topics lead to video segmentation/summarization and are used for automatically acquiring the object models described in Chapter 4. We employ HMMs for topic identification, wherein a state corresponds to a topic and various features including linguistic, visual and audio information are observed. This study considers a clause as an unit of analysis and the following eight topics as a set of states: preparation, sauteing, frying, baking, simmering, boiling, dishing up, steaming. The basic linguistic feature is a case frame, which is a generalization of utterances referring to an action, such as ``ireru(add)\u0027\u0027and ``kiru(cut)\u0027\u0027. Furthermore, we incorporate domain-independent discourse features such as cue phrases, noun/verb chaining, which indicate topic change/persistence, into the case frame. We utilize visual and audio information to achieve robust topic identification. As for visual information, we can utilize background color distribution of the image. As for audio information, silence can be utilized as a clue to a topic shift. Chapter 4 presents a method for automatically acquiring object models from large amounts of video for performing object recognition. We first collect pairs of a close-up image and a keyword. Close-up images are extracted with edge detection and, in the close-up image, region segmentation is performed and the salient region is determined considering the following points: area, center of gravity and variance of pixels in a region. A keyword is extracted from instructor\u0027s utterances when the close-up image appears. In case of cooking, objects (i.e. ingredient) change their shape/color along with the progress of cooking. Consequently, good examples for object acquisition cannot be collected from video segments whose topic is sauteing or dishing up. Therefore, a keyword is extracted only from segments whose topic, which is identified by the proposed method, is preparation. The important score of each word is calculated according to the linguistic analysis result, such as the discourse structure analysis and utterance-type detection, and the word that has the maximum score is extracted as a keyword. After collecting pairs of a close-up image and a keyword, for each keyword, its object model is acquired by summing RGB histograms in the salient region. Next, we perform object recognition based on the acquired object model and the discourse structure. We can acquire the object model of around 100 foods and its accuracy is 0.778, and the accuracy of object recognition is 0.727. Chapter 5 describes our video retrieval system. In this system, a user can ask a query in natural language and can enjoy the search result, which is similar to the user\u0027s query. To present the accessible mean to the video, we generate a summary of the video. This analysis is based on topic segmentation, important utterances extraction, topic identification result, object recognition result.", "subitem_description_type": "Abstract"}]}, "item_7_dissertation_number_26": {"attribute_name": "学位授与番号", "attribute_value_mlt": [{"subitem_dissertationnumber": "甲第22808号"}]}, "item_7_full_name_3": {"attribute_name": "著者別名", "attribute_value_mlt": [{"nameIdentifiers": [{"nameIdentifier": "6613", "nameIdentifierScheme": "WEKO"}], "names": [{"name": "柴田, 知秀"}]}]}, "item_7_identifier_registration": {"attribute_name": "ID登録", "attribute_value_mlt": [{"subitem_identifier_reg_text": "10.15083/00002379", "subitem_identifier_reg_type": "JaLC"}]}, "item_7_select_21": {"attribute_name": "学位", "attribute_value_mlt": [{"subitem_select_item": "doctoral"}]}, "item_7_subject_13": {"attribute_name": "日本十進分類法", "attribute_value_mlt": [{"subitem_subject": "007", "subitem_subject_scheme": "NDC"}]}, "item_7_text_22": {"attribute_name": "学位分野", "attribute_value_mlt": [{"subitem_text_value": "Information Science and Technology (情報理工学)"}]}, "item_7_text_24": {"attribute_name": "研究科・専攻", "attribute_value_mlt": [{"subitem_text_value": "Department of Information and Communication Engineering, Graduate School of Information Science and Technology (情報理工学系研究科電子情報学専攻)"}]}, "item_7_text_27": {"attribute_name": "学位記番号", "attribute_value_mlt": [{"subitem_text_value": "博情第138号"}]}, "item_7_text_36": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"subitem_text_value": "Thesis"}]}, "item_7_text_4": {"attribute_name": "著者所属", "attribute_value_mlt": [{"subitem_text_value": "大学院情報理工学系研究科電子情報学専攻"}]}, "item_creator": {"attribute_name": "著者", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "Shibata, Tomohide"}], "nameIdentifiers": [{"nameIdentifier": "6612", "nameIdentifierScheme": "WEKO"}]}]}, "item_files": {"attribute_name": "ファイル情報", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_date", "date": [{"dateType": "Available", "dateValue": "2017-05-31"}], "displaytype": "detail", "download_preview_message": "", "file_order": 0, "filename": "shibata.pdf", "filesize": [{"value": "15.2 MB"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 15200000.0, "url": {"label": "shibata.pdf", "url": "https://repository.dl.itc.u-tokyo.ac.jp/record/2385/files/shibata.pdf"}, "version_id": "fa83ba1e-0d9d-4323-8131-5fe552c30a81"}]}, "item_language": {"attribute_name": "言語", "attribute_value_mlt": [{"subitem_language": "eng"}]}, "item_resource_type": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"resourcetype": "thesis", "resourceuri": "http://purl.org/coar/resource_type/c_46ec"}]}, "item_title": "Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information", "item_titles": {"attribute_name": "タイトル", "attribute_value_mlt": [{"subitem_title": "Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information"}]}, "item_type_id": "7", "owner": "1", "path": ["280", "330"], "permalink_uri": "https://doi.org/10.15083/00002379", "pubdate": {"attribute_name": "公開日", "attribute_value": "2012-03-01"}, "publish_date": "2012-03-01", "publish_status": "0", "recid": "2385", "relation": {}, "relation_version_is_last": true, "title": ["Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information"], "weko_shared_id": null}

Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information

https://doi.org/10.15083/00002379

名前 / ファイル	ライセンス	アクション
shibata.pdf (15.2 MB)

Item type

学位論文 / Thesis or Dissertation(1)

公開日

2012-03-01

タイトル

Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information

言語

eng

資源タイプ

資源

http://purl.org/coar/resource_type/c_46ec

タイプ

thesis

ID登録

10.15083/00002379

ID登録タイプ

JaLC

その他のタイトル

言語情報と映像情報の統合による作業教示映像の構造的理解

著者

Shibata, Tomohide

著者別名

識別子

6613

識別子Scheme

WEKO

姓名

柴田, 知秀

著者所属

大学院情報理工学系研究科電子情報学専攻

Abstract

内容記述タイプ

Abstract

内容記述

To perform real-word information processing, such as intelligent robotics, multimodal dialogue system and video processing, it is essential to integrate several media processing technique such as natural language processing, speech recognition and image analysis. From the viewpoint of natural language processing, since language in the real world is strongly depends on the scene, it is important to understand utterances in accordance with the scene. This thesis focuses on handling video contents. Among several types of videos, in which instruction videos (how-to videos) about sports, cooking, D.I.Y., and others are the most valuable, we focus on cooking TV programs. In realizing flexible utilization/access of video contents, the crucial point is the structural understanding of their contents, which requires the interpretation of utterances based on wider contexts including the scene. Chapter 2 describes basic linguistic analysis of cooking instruction utterances (closed caption texts). First, we perform anaphora resolution, which is inevitable to detect the discourse structure or correspond linguistic information to visual information. We build an anaphora resolution system based on the large-scale case frame. Next, we detect utterance-type of a clause of each utterance. In cooking instruction utterances, while explanations of actions are dominant, there are several types of utterances such as declaration of beginning of series of actions, tips of actions, notes, etc. We classify cooking instruction utterance and recognize utterance-type by clause-end patterns. Then, we analyze the discourse structure of instruction utterances. This analysis is performed by integrating the anaphora resolution result, utterance-type and generic discourse structure rules, which consider cue phrases and word chaining. Chapter 3 proposes an unsupervised topic identification method integrating linguistic and visual information based on Hidden Markov Models (HMMs). Identified topics lead to video segmentation/summarization and are used for automatically acquiring the object models described in Chapter 4. We employ HMMs for topic identification, wherein a state corresponds to a topic and various features including linguistic, visual and audio information are observed. This study considers a clause as an unit of analysis and the following eight topics as a set of states: preparation, sauteing, frying, baking, simmering, boiling, dishing up, steaming. The basic linguistic feature is a case frame, which is a generalization of utterances referring to an action, such as ``ireru(add)''and ``kiru(cut)''. Furthermore, we incorporate domain-independent discourse features such as cue phrases, noun/verb chaining, which indicate topic change/persistence, into the case frame. We utilize visual and audio information to achieve robust topic identification. As for visual information, we can utilize background color distribution of the image. As for audio information, silence can be utilized as a clue to a topic shift. Chapter 4 presents a method for automatically acquiring object models from large amounts of video for performing object recognition. We first collect pairs of a close-up image and a keyword. Close-up images are extracted with edge detection and, in the close-up image, region segmentation is performed and the salient region is determined considering the following points: area, center of gravity and variance of pixels in a region. A keyword is extracted from instructor's utterances when the close-up image appears. In case of cooking, objects (i.e. ingredient) change their shape/color along with the progress of cooking. Consequently, good examples for object acquisition cannot be collected from video segments whose topic is sauteing or dishing up. Therefore, a keyword is extracted only from segments whose topic, which is identified by the proposed method, is preparation. The important score of each word is calculated according to the linguistic analysis result, such as the discourse structure analysis and utterance-type detection, and the word that has the maximum score is extracted as a keyword. After collecting pairs of a close-up image and a keyword, for each keyword, its object model is acquired by summing RGB histograms in the salient region. Next, we perform object recognition based on the acquired object model and the discourse structure. We can acquire the object model of around 100 foods and its accuracy is 0.778, and the accuracy of object recognition is 0.727. Chapter 5 describes our video retrieval system. In this system, a user can ask a query in natural language and can enjoy the search result, which is similar to the user's query. To present the accessible mean to the video, we generate a summary of the video. This analysis is based on topic segmentation, important utterances extraction, topic identification result, object recognition result.

書誌情報

発行日 2007-03-22

日本十進分類法

主題

007

主題Scheme

NDC

学位名

博士(情報理工学)

学位

値

doctoral

学位分野

Information Science and Technology (情報理工学)

学位授与機関

学位授与機関名

University of Tokyo (東京大学)

研究科・専攻

Department of Information and Communication Engineering, Graduate School of Information Science and Technology (情報理工学系研究科電子情報学専攻)

学位授与年月日

2007-03-22

学位授与番号

甲第22808号

学位記番号

博情第138号

戻る

views

See details

	Views

Versions

Ver.1

2021-03-01 19:58:10.481981

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information

× Shibata, Tomohide

Versions

Share

Cite as

エクスポート