{"created":"2021-03-01T06:22:07.033245+00:00","id":5421,"links":{},"metadata":{"_buckets":{"deposit":"13d3e6b4-79fa-46b8-8375-16b0ddd7bbb1"},"_deposit":{"id":"5421","owners":[],"pid":{"revision_id":0,"type":"depid","value":"5421"},"status":"published"},"_oai":{"id":"oai:repository.dl.itc.u-tokyo.ac.jp:00005421","sets":["34:105:330","9:233:280"]},"item_7_alternative_title_1":{"attribute_name":"その他のタイトル","attribute_value_mlt":[{"subitem_alternative_title":"メトリック空間における類似検索の高速化に関する研究"}]},"item_7_biblio_info_7":{"attribute_name":"書誌情報","attribute_value_mlt":[{"bibliographicIssueDates":{"bibliographicIssueDate":"2011-03-24","bibliographicIssueDateType":"Issued"},"bibliographic_titles":[{}]}]},"item_7_date_granted_25":{"attribute_name":"学位授与年月日","attribute_value_mlt":[{"subitem_dategranted":"2011-03-24"}]},"item_7_degree_grantor_23":{"attribute_name":"学位授与機関","attribute_value_mlt":[{"subitem_degreegrantor":[{"subitem_degreegrantor_name":"University of Tokyo (東京大学)"}]}]},"item_7_degree_name_20":{"attribute_name":"学位名","attribute_value_mlt":[{"subitem_degreename":"博士(情報理工学)"}]},"item_7_description_5":{"attribute_name":"抄録","attribute_value_mlt":[{"subitem_description":"The purpose of my research is to reduce the query execution cost of a similarity search in metric spaces. Although a large number of studies have been made on indexing techniques for similarity searches, only a few exploit the data's distribution in a metric space. I developed a new partitioning scheme based on the data distribution that can prune objects more effectively while searching. Finding similar objects in a large dataset is a fundamental process in various applications, such as record linkage, and the Geographic Information System (GIS). Lowering the query execution cost of a similarity search can speed up these applications. Because metric-space similarity searches can be applied to all types of data whose distance obey metric space postulates such as the triangle inequality, they are very useful for applications that deal with huge amounts of vectors, strings, graphs, sets, and so on. Similarity search indexes are used for pruning objects dissimilar to a query, and they reduce distance computations and disk accesses. Most indexing schemes use pivots, which are reference objects. They recursively divide a region into subregions by using pivots and construct a tree structure index. They prune subregions while searching by using the triangle inequality. That is, the way of selecting pivots and using them to divide up the space determines the index structure and pruning performance. The early studies selected pivots by using heuristics. They asserted that good pivots should be outliners of the space, because the distance from such a pivot to each object vary and the pivot easily classifies the objects. They used simple statistical features such as the mean and the variance for selecting pivots. However, their choices ended up being only a little better than random selection. Several selection mechanisms exploit data distributions by using clustering techniques. These methods set the cluster centers as the pivots and divide up the space on the basis of the distances from the pivots. Although they can effectively classify dense regions, they only work well on particular distribution patterns. A cluster may be separated into multiple regions by a pivot, because the partitioning boundaries are based on the cluster'centers rather than their shapes. My work focused on a problem with the existing indexes wherein they don't consider the partitioning boundary of the pivot and cannot select a good pivot for pruning. I devised a new pivot selection approach called the \"data distribution approach\", which is based on the data distribution. On the basis of the data distribution approach, I developed two novel methods for pivot partitioning, called Maximal Metric Margin Partitioning (MMMP) and Pivot Capacity Tree (PCTree). These indexing methods successfully reduce the similarity search cost. MMMP first finds the data distribution pattern, especially the boundaries of clusters. Then, it selects a pivot and its partitioning distance on the basis of the shapes of the data clusters. The partitioning boundary is at the maximum distance from the neighbor cluster edges. MMMP deals well with clustered data. On the other hand, PCTree considers the index tree's balance as well as the data distribution. PCTree chooses a pivot on the basis of the balance of the subregions and the estimated effectiveness of the search pruning. As a result, it automatically optimizes the index structure to the data distribution. It also reduces the imbalance of the tree in comparison with MMMP. The main contribution of this thesis is the data distribution approach for pivot selection. To my best knowledge, this is the first study to exploit the cluster's shapes and tree balancing when making pivot selections. I conducted several experiments that empirically compared these methods with metric space indexes and they showed the methods to be very efficient. In this thesis, I will introduce the basic techniques of similarity searches and then show my methods and experimental results.","subitem_description_type":"Abstract"},{"subitem_description":"本研究の目的は, メトリック空間における類似検索の問い合わせ処理コスト削減である. これまで数多くの類似検索のための索引付け手法が提案されてきたが, いずれの手法も検索対象のオブジェクトの特徴を索引に活かせていなかった. 筆者は, 検索処理過程における枝刈りを効果的にする, オブジェクトの分布の特徴にもとづいた索引付け処理を研究開発することを目指した. 類似検索は, 膨大なオブジェクトの中からクエリに似たものを探す技術である. 類似検索は, レコードリンケージや地理情報システムなど, 様々なアプリケーションの基盤技術として使われている. メトリック空間(距離空間)における類似検索技術は, 三角不等式のような距離の公理を満たすすべてのオブジェクトと距離関数を対象とする. それゆえ, メトリック空間の索引は, ベクトルや文字列, グラフ, 集合といったデータを大量に処理するアプリケーションに役立つ技術である. 類似検索索引は, クエリから距離の遠いオブジェクトを枝刈りし, 距離計算やディスクアクセスといった検索コストを削減する. ほぼすべての索引付け手法は, Pivotと呼ばれる参照オブジェクトを使っている. 索引付け処理では, Pivotで空間を部分空間へ再帰的に分割し, 木構造の索引を構築する. この索引は, 検索処理中に三角不等式で分割された部分空間を枝刈りするのに使われる. つまり, Pivotの選択手法とPivotを使った空間分割手法は, 検索索引の構造と枝刈りの性能を決定すると言える. 先行研究のうち初期の頃のPivot選択手法は経験則にもとづいていた. これらの手法では, Pivotから各オブジェクトまでの距離が分散していればオブジェクトの識別が容易になるとの理由から, オブジェクト空間の端にPivotが位置すると良いと主張していた. そして, 平均値や分散値といった単純な統計値をPivot選択に使っていた. しかしながら, このような手法で選ばれたPivotはランダム選択と比べて大きな改善は見られなかった. そこで, クラスタリング技術を使ってデータの分布をPivot選択に活用する手法が提案された. クラスタリング技術を使った手法では, クラスタの重心をPivotとして設定し, Pivotからの距離にもとづいて空間を分割している. これらの手法はオブジェクトの密集する空間を効果的に分類できるが, 特定のデータの分布にのみ有効である. クラスタの形状ではなくクラスタの重心をもとに分割面を決めているため, クラスタがPivotによって複数の部分空間に切り離されかねない. 筆者は, Pivotによって形成される分割面について考慮が足りないために枝刈りに効果的なPivotを選べないという, 既存手法の問題に注目した. 筆者は\"data distribution-based approach\"と呼ぶデータの分布にもとづいたPivot選択の方策を提案した. そしてこの方策にもとづく, 2つの新しいPivot分割手法, Maximal Metric Margin Partitioning (MMMP)とPivot Capacity Tree (PCTree)を開発した. MMMPはまず, データの分布傾向のうち特にクラスタの境界を抽出する. そして, クラスタ形状にもとづいてPivotとその分割距離を決める. MMMPの分割面は, 隣り合うクラスタの端からの距離を最大にするように置かれる. MMMPは偏った分布のデータに対して効果的な手法である. 一方, PCTreeはデータの分布だけでなく, 索引木のバランスも考慮した手法である. PCTreeでは, Pivotによって分割される部分空間のバランスと, Pivotによる検索時の枝刈りの効果の, 2つを考慮してPivotを選択する. その結果, PCTreeはデータの分布に合わせて索引構造を効果的に変化させている. PCTreeは, MMMPの索引木が不均衡になりうる欠点を改善した手法だと言える. これら2つの手法はともに, 類似検索の検索コストの削減に役立つ. 本研究の貢献は, Pivot選択の新しい方策data distribution-based approachを確立したことだ. 筆者の知る限りでは, クラスタ形状や索引木のバランスをPivot選択に考慮した研究はこれが初めてである. 筆者はこれら2つの提案手法の効果を, 既存研究と比較した実験結果を通じて明らかにした. 本学位論文では, 類似検索の基盤技術について紹介した後に, 提案手法とその実験結果について説明する.","subitem_description_type":"Abstract"}]},"item_7_dissertation_number_26":{"attribute_name":"学位授与番号","attribute_value_mlt":[{"subitem_dissertationnumber":"甲第27286号"}]},"item_7_full_name_3":{"attribute_name":"著者別名","attribute_value_mlt":[{"nameIdentifiers":[{"nameIdentifier":"11322","nameIdentifierScheme":"WEKO"}],"names":[{"name":"倉沢, 央"}]}]},"item_7_identifier_registration":{"attribute_name":"ID登録","attribute_value_mlt":[{"subitem_identifier_reg_text":"10.15083/00005412","subitem_identifier_reg_type":"JaLC"}]},"item_7_select_21":{"attribute_name":"学位","attribute_value_mlt":[{"subitem_select_item":"doctoral"}]},"item_7_subject_13":{"attribute_name":"日本十進分類法","attribute_value_mlt":[{"subitem_subject":"007","subitem_subject_scheme":"NDC"}]},"item_7_text_22":{"attribute_name":"学位分野","attribute_value_mlt":[{"subitem_text_value":"Information Science and Technology (情報理工学)"}]},"item_7_text_24":{"attribute_name":"研究科・専攻","attribute_value_mlt":[{"subitem_text_value":"Department of Information and Communication Engineering, Graduate School of Information Science and Technology (情報理工学系研究科電子情報学専攻)"}]},"item_7_text_27":{"attribute_name":"学位記番号","attribute_value_mlt":[{"subitem_text_value":"博情第324号"}]},"item_7_text_4":{"attribute_name":"著者所属","attribute_value_mlt":[{"subitem_text_value":"東京大学大学院情報理工学系研究科電子情報学専攻"}]},"item_creator":{"attribute_name":"著者","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"Kurasawa, Hisashi"}],"nameIdentifiers":[{"nameIdentifier":"11321","nameIdentifierScheme":"WEKO"}]}]},"item_files":{"attribute_name":"ファイル情報","attribute_type":"file","attribute_value_mlt":[{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2017-06-01"}],"displaytype":"detail","filename":"48087404.pdf","filesize":[{"value":"5.3 MB"}],"format":"application/pdf","licensetype":"license_note","mimetype":"application/pdf","url":{"label":"48087404.pdf","url":"https://repository.dl.itc.u-tokyo.ac.jp/record/5421/files/48087404.pdf"},"version_id":"7111d5c6-7d9e-4d12-a1b4-3d0c5fd7123b"},{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2017-06-01"}],"displaytype":"detail","filename":"127286b-sinsa.pdf","filesize":[{"value":"99.7 kB"}],"format":"application/pdf","licensetype":"license_note","mimetype":"application/pdf","url":{"label":"127286b-sinsa.pdf","url":"https://repository.dl.itc.u-tokyo.ac.jp/record/5421/files/127286b-sinsa.pdf"},"version_id":"39e5d7b0-f5c8-41fe-ae45-b432fb29e242"},{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2017-06-01"}],"displaytype":"detail","filename":"127286a-abst.pdf","filesize":[{"value":"43.5 kB"}],"format":"application/pdf","licensetype":"license_note","mimetype":"application/pdf","url":{"label":"127286a-abst.pdf","url":"https://repository.dl.itc.u-tokyo.ac.jp/record/5421/files/127286a-abst.pdf"},"version_id":"908159f1-078e-44fe-94eb-1cff251d21fd"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourcetype":"thesis","resourceuri":"http://purl.org/coar/resource_type/c_46ec"}]},"item_title":"A Study of Fast Similarity Search Techniques in Metric Spaces","item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"A Study of Fast Similarity Search Techniques in Metric Spaces"}]},"item_type_id":"7","owner":"1","path":["280","330"],"pubdate":{"attribute_name":"公開日","attribute_value":"2013-06-20"},"publish_date":"2013-06-20","publish_status":"0","recid":"5421","relation_version_is_last":true,"title":["A Study of Fast Similarity Search Techniques in Metric Spaces"],"weko_creator_id":"1","weko_shared_id":null},"updated":"2022-12-19T03:47:00.433917+00:00"}