UTokyo Repository 東京大学
 

UTokyo Repository >
124 情報理工学系研究科 >
40 電子情報学専攻 >
1244020 博士論文(電子情報学専攻) >

このページ(論文)をリンクする場合は次のURLを使用してください: http://hdl.handle.net/2261/44119

タイトル: A Study on the Evolution and Emergence of Web Spam
その他のタイトル: ウェブスパムの進化と出現に関する研究
著者: CHUNG, YOUNGJOO
著者(別言語): 鄭, 容朱
発行日: 2011年3月24日
抄録: Web spamming has emerged to deceive search engines and obtain a higher ranking in search result lists which brings more traffic and profits to web sites. Link spamming is the major spamming technique that manipulates the link structure of the Web to deceive link-based ranking algorithms that regard incoming links to a page as endorsements to it. Spam pages using link spamming techniques are need to be eliminated since they damage the quality of search results and contaminate web mining and analysis results with useless pages. They are, however, also interesting social activities in cyberspace. In this thesis, I study the evolution and emergence of web spam in three-yearly large-scale of Japanese Web archives containing million hosts and 83 million links. As far as I know, the overall characteristics of web spam in a time-series of web snapshots of this scale have never been explored. Understanding the evolution of web spam pages, such as their growth in the number and change in topics over time, is helpful in developing new spam detection techniques and tracking spam sites for topic shift observation. Understanding the emergence of web spam pages, such as continuously created links to spam pages, is helpful in collecting new spam samples for spam classifier update. To understand the evolution of web spam, I analyze temporal changes in the size and topics of web spam pages. To clarify global characteristics of web spam pages such as distribution and topics in the single snapshot, I first propose a method for extracting spam link structures, link farms, from large-scale of web graphs. I then investigate the evolution of size and topic distributions in link farms. It is found that link farms were isolated from each other and most of them did not grow; the overall topic distribution in link farms was not significantly changed, although new link farms appeared and hosts in them dynamically changed. To understand the emergence of web spam, I focus on pages that contain links to spam pages. I propose a method for detecting hijacked sites, which are legitimate sites containing links to spam sites, and evaluate its detection precision. It is confirmed that monitoring hijacked sites is helpful in discovering emerging spam sites. On the other hand, I propose a method for identifying spam link generators, which are hosts that will generate links to spam hosts, and evaluate its identification accuracy. It is found that many links to spam hosts were created by spam link generators and some of spam link generator detected in 2004 and 2005 are still generating links to spam hosts in 2010. Main Contribution of this thesis are as following: ・I clarify overall distribution and evolution of link farms in large-scale Japanese Web graph for three years containing four million hosts and 83 million links. As far as I know, the overall characteristics of link farms in a time-series of web snapshots of this scale have never been explored. I propose a method for efficiently extracting many link farms and investigate the distribution of extracted link farms. It is found that link farms in the core recursively showed similar distribution. I categorize spam hosts in link farms into seven topics and build topic classifiers. It is found that two dominant topics accounted for over 60% of all spam hosts in link farms. ・I analyze the evolution of link farms in the aspect of their size and topics. It is found that link farms were isolated from each other; many link farms maintained for years, but most of them did not grow; the distribution of topics in link farms was not significantly changed while new link farms appeared and hosts and keywords related to each topic dynamically changed. ・I study link hijacking techniques and propose a method for detecting hijacked sites. I investigate characteristics of hijacked sites and categorize them into several types. I detect hijacked sites with high precision and show that emerging spam sites can be discovered by monitoring hijacked sites. ・I study spam link generators that generate and propose several features for identifying spam link generators. It is found that almost new links pointing to spam hosts are created by spam link generators. I identify spam link generators with high accuracy and show that some spam link generators detected in 2004 and 2005 still generate links to spam pages in 2010.
内容記述: 報告番号: 甲27288 ; 学位授与年月日: 2011-03-24 ; 学位の種別: 課程博士 ; 学位の種類: 博士(情報理工学) ; 学位記番号: 博情第326号 ; 研究科・専攻: 情報理工学系研究科電子情報学専攻
URI: http://hdl.handle.net/2261/44119
出現カテゴリ:021 博士論文
1244020 博士論文(電子情報学専攻)

この論文のファイル:

ファイル 記述 サイズフォーマット
48087407.pdf9.28 MBAdobe PDF見る/開く

本リポジトリに保管されているアイテムはすべて著作権により保護されています。

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - ご意見をお寄せください