ログイン
言語:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. 124 情報理工学系研究科
  2. 40 電子情報学専攻
  3. 1244020 博士論文(電子情報学専攻)
  1. 0 資料タイプ別
  2. 20 学位論文
  3. 021 博士論文

Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information

https://doi.org/10.15083/00002379
https://doi.org/10.15083/00002379
5b9318ed-fadc-4ab8-8d6d-47328c16a2ec
名前 / ファイル ライセンス アクション
shibata.pdf shibata.pdf (15.2 MB)
Item type 学位論文 / Thesis or Dissertation(1)
公開日 2012-03-01
タイトル
タイトル Structural Understanding of Instruction Videos by Integrating Linguistic and Visual Information
言語
言語 eng
資源タイプ
資源 http://purl.org/coar/resource_type/c_46ec
タイプ thesis
ID登録
ID登録 10.15083/00002379
ID登録タイプ JaLC
その他のタイトル
その他のタイトル 言語情報と映像情報の統合による作業教示映像の構造的理解
著者 Shibata, Tomohide

× Shibata, Tomohide

WEKO 6612

Shibata, Tomohide

Search repository
著者別名
識別子Scheme WEKO
識別子 6613
姓名 柴田, 知秀
著者所属
著者所属 大学院情報理工学系研究科電子情報学専攻
Abstract
内容記述タイプ Abstract
内容記述 To perform real-word information processing, such as intelligent robotics, multimodal dialogue system and video processing, it is essential to integrate several media processing technique such as natural language processing, speech recognition and image analysis. From the viewpoint of natural language processing, since language in the real world is strongly depends on the scene, it is important to understand utterances in accordance with the scene. This thesis focuses on handling video contents. Among several types of videos, in which instruction videos (how-to videos) about sports, cooking, D.I.Y., and others are the most valuable, we focus on cooking TV programs. In realizing flexible utilization/access of video contents, the crucial point is the structural understanding of their contents, which requires the interpretation of utterances based on wider contexts including the scene. Chapter 2 describes basic linguistic analysis of cooking instruction utterances (closed caption texts). First, we perform anaphora resolution, which is inevitable to detect the discourse structure or correspond linguistic information to visual information. We build an anaphora resolution system based on the large-scale case frame. Next, we detect utterance-type of a clause of each utterance. In cooking instruction utterances, while explanations of actions are dominant, there are several types of utterances such as declaration of beginning of series of actions, tips of actions, notes, etc. We classify cooking instruction utterance and recognize utterance-type by clause-end patterns. Then, we analyze the discourse structure of instruction utterances. This analysis is performed by integrating the anaphora resolution result, utterance-type and generic discourse structure rules, which consider cue phrases and word chaining. Chapter 3 proposes an unsupervised topic identification method integrating linguistic and visual information based on Hidden Markov Models (HMMs). Identified topics lead to video segmentation/summarization and are used for automatically acquiring the object models described in Chapter 4. We employ HMMs for topic identification, wherein a state corresponds to a topic and various features including linguistic, visual and audio information are observed. This study considers a clause as an unit of analysis and the following eight topics as a set of states: preparation, sauteing, frying, baking, simmering, boiling, dishing up, steaming. The basic linguistic feature is a case frame, which is a generalization of utterances referring to an action, such as ``ireru(add)''and ``kiru(cut)''. Furthermore, we incorporate domain-independent discourse features such as cue phrases, noun/verb chaining, which indicate topic change/persistence, into the case frame. We utilize visual and audio information to achieve robust topic identification. As for visual information, we can utilize background color distribution of the image. As for audio information, silence can be utilized as a clue to a topic shift. Chapter 4 presents a method for automatically acquiring object models from large amounts of video for performing object recognition. We first collect pairs of a close-up image and a keyword. Close-up images are extracted with edge detection and, in the close-up image, region segmentation is performed and the salient region is determined considering the following points: area, center of gravity and variance of pixels in a region. A keyword is extracted from instructor's utterances when the close-up image appears. In case of cooking, objects (i.e. ingredient) change their shape/color along with the progress of cooking. Consequently, good examples for object acquisition cannot be collected from video segments whose topic is sauteing or dishing up. Therefore, a keyword is extracted only from segments whose topic, which is identified by the proposed method, is preparation. The important score of each word is calculated according to the linguistic analysis result, such as the discourse structure analysis and utterance-type detection, and the word that has the maximum score is extracted as a keyword. After collecting pairs of a close-up image and a keyword, for each keyword, its object model is acquired by summing RGB histograms in the salient region. Next, we perform object recognition based on the acquired object model and the discourse structure. We can acquire the object model of around 100 foods and its accuracy is 0.778, and the accuracy of object recognition is 0.727. Chapter 5 describes our video retrieval system. In this system, a user can ask a query in natural language and can enjoy the search result, which is similar to the user's query. To present the accessible mean to the video, we generate a summary of the video. This analysis is based on topic segmentation, important utterances extraction, topic identification result, object recognition result.
書誌情報 発行日 2007-03-22
日本十進分類法
主題Scheme NDC
主題 007
学位名
学位名 博士(情報理工学)
学位
値 doctoral
学位分野
Information Science and Technology (情報理工学)
学位授与機関
学位授与機関名 University of Tokyo (東京大学)
研究科・専攻
Department of Information and Communication Engineering, Graduate School of Information Science and Technology (情報理工学系研究科電子情報学専攻)
学位授与年月日
学位授与年月日 2007-03-22
学位授与番号
学位授与番号 甲第22808号
学位記番号
博情第138号
戻る
0
views
See details
Views

Versions

Ver.1 2021-03-01 19:58:10.481981
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Cite as

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR 2.0
  • OAI-PMH JPCOAR 1.0
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX

Confirm


Powered by WEKO3


Powered by WEKO3