ログイン
言語:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. 124 情報理工学系研究科
  2. 40 電子情報学専攻
  3. 1244020 博士論文(電子情報学専攻)
  1. 0 資料タイプ別
  2. 20 学位論文
  3. 021 博士論文

A Study on the Evolution and Emergence of Web Spam

https://doi.org/10.15083/00004101
https://doi.org/10.15083/00004101
17b24fb8-52f0-40ce-a5a4-1f03f72e85db
名前 / ファイル ライセンス アクション
48087407.pdf 48087407.pdf (9.5 MB)
Item type 学位論文 / Thesis or Dissertation(1)
公開日 2012-11-09
タイトル
タイトル A Study on the Evolution and Emergence of Web Spam
言語
言語 eng
資源タイプ
資源 http://purl.org/coar/resource_type/c_46ec
タイプ thesis
ID登録
ID登録 10.15083/00004101
ID登録タイプ JaLC
その他のタイトル
その他のタイトル ウェブスパムの進化と出現に関する研究
著者 CHUNG, YOUNGJOO

× CHUNG, YOUNGJOO

WEKO 9463

CHUNG, YOUNGJOO

Search repository
著者別名
識別子Scheme WEKO
識別子 9464
姓名 鄭, 容朱
著者所属
著者所属 東京大学大学院情報理工学系研究科電子情報学専攻
Abstract
内容記述タイプ Abstract
内容記述 Web spamming has emerged to deceive search engines and obtain a higher ranking in search result lists which brings more traffic and profits to web sites. Link spamming is the major spamming technique that manipulates the link structure of the Web to deceive link-based ranking algorithms that regard incoming links to a page as endorsements to it. Spam pages using link spamming techniques are need to be eliminated since they damage the quality of search results and contaminate web mining and analysis results with useless pages. They are, however, also interesting social activities in cyberspace. In this thesis, I study the evolution and emergence of web spam in three-yearly large-scale of Japanese Web archives containing million hosts and 83 million links. As far as I know, the overall characteristics of web spam in a time-series of web snapshots of this scale have never been explored. Understanding the evolution of web spam pages, such as their growth in the number and change in topics over time, is helpful in developing new spam detection techniques and tracking spam sites for topic shift observation. Understanding the emergence of web spam pages, such as continuously created links to spam pages, is helpful in collecting new spam samples for spam classifier update. To understand the evolution of web spam, I analyze temporal changes in the size and topics of web spam pages. To clarify global characteristics of web spam pages such as distribution and topics in the single snapshot, I first propose a method for extracting spam link structures, link farms, from large-scale of web graphs. I then investigate the evolution of size and topic distributions in link farms. It is found that link farms were isolated from each other and most of them did not grow; the overall topic distribution in link farms was not significantly changed, although new link farms appeared and hosts in them dynamically changed. To understand the emergence of web spam, I focus on pages that contain links to spam pages. I propose a method for detecting hijacked sites, which are legitimate sites containing links to spam sites, and evaluate its detection precision. It is confirmed that monitoring hijacked sites is helpful in discovering emerging spam sites. On the other hand, I propose a method for identifying spam link generators, which are hosts that will generate links to spam hosts, and evaluate its identification accuracy. It is found that many links to spam hosts were created by spam link generators and some of spam link generator detected in 2004 and 2005 are still generating links to spam hosts in 2010. Main Contribution of this thesis are as following: ・I clarify overall distribution and evolution of link farms in large-scale Japanese Web graph for three years containing four million hosts and 83 million links. As far as I know, the overall characteristics of link farms in a time-series of web snapshots of this scale have never been explored. I propose a method for efficiently extracting many link farms and investigate the distribution of extracted link farms. It is found that link farms in the core recursively showed similar distribution. I categorize spam hosts in link farms into seven topics and build topic classifiers. It is found that two dominant topics accounted for over 60% of all spam hosts in link farms. ・I analyze the evolution of link farms in the aspect of their size and topics. It is found that link farms were isolated from each other; many link farms maintained for years, but most of them did not grow; the distribution of topics in link farms was not significantly changed while new link farms appeared and hosts and keywords related to each topic dynamically changed. ・I study link hijacking techniques and propose a method for detecting hijacked sites. I investigate characteristics of hijacked sites and categorize them into several types. I detect hijacked sites with high precision and show that emerging spam sites can be discovered by monitoring hijacked sites. ・I study spam link generators that generate and propose several features for identifying spam link generators. It is found that almost new links pointing to spam hosts are created by spam link generators. I identify spam link generators with high accuracy and show that some spam link generators detected in 2004 and 2005 still generate links to spam pages in 2010.
書誌情報 発行日 2011-03-24
日本十進分類法
主題Scheme NDC
主題 007
学位名
学位名 博士(情報理工学)
学位
値 doctoral
学位分野
Information Science and Technology (情報理工学)
学位授与機関
学位授与機関名 University of Tokyo (東京大学)
研究科・専攻
Department of Information and Communication Engineering, Graduate School of Information Science and Technology (情報理工学系研究科電子情報学専攻)
学位授与年月日
学位授与年月日 2011-03-24
学位授与番号
学位授与番号 甲第27288号
学位記番号
博情第326号
戻る
0
views
See details
Views

Versions

Ver.1 2021-03-01 19:50:01.848246
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Cite as

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR 2.0
  • OAI-PMH JPCOAR 1.0
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX

Confirm


Powered by WEKO3


Powered by WEKO3