標題: | Churn: 真實 P2P 應用軟體之關鍵效能因子 Churn: a Key Effect on Real-world P2P Software |
作者: | 鍾明辰 Chung, Ming-Chen 曾建超 Tseng, Chien-Chao 資訊科學與工程研究所 |
關鍵字: | 同儕網路;網路擾動;真實環境;非架構化;檔案分享;Gnutella;效能;拓樸;快照;爬蟲;Peer-to-peer;Churn;Real-world;Unstructured;File-sharing;Gnutella;Performance;Topology;Snapshot;Crawler |
公開日期: | 2012 |
摘要: | 在這篇論文中,我們設計與實作壹套非結構式P2P (unstructured P2P)網路拓樸蒐集與效能實驗系統,透過實驗數據的分析,深入探討“Churn”對實際應用的非結構式P2P (unstructured P2P)軟體所帶來的影響。P2P (peer-to-peer)應用程式近十年來在網際網路上大幅地成長。於P2P網路中,由於許多節點 (peer,也稱作servent)不斷地加入和離開的動態行拓樸為,節點會不斷地更換其鄰居 (neighbors)。這種節點之間動態交互連結的現象稱為“Churn”,Churn會嚴重地影響P2P軟體的效能。Churn對結構式P2P (structured P2P)效能的影響已被詳盡地探究。然而,因為非結構式P2P複雜的運作環境,鮮有研究去探討Churn對非結構式P2P的影響。因此,這篇論文著重於Churn對非結構式P2P的影響的研究。
為了分析Churn和P2P網路效能的關係,我們必須蒐集兩組數據: P2P網路快照及其效能數值。數據的收集有三個條件: 節點數量要足夠、收集時間要短及對P2P網路造成的負擔要小,然而,要兼顧這三個條件是困難的。整體而言,在P2P網路的節點數超過百萬,每個快照必須包含足夠的代表性節點。但是,因為P2P網路拓樸會隨著時間不斷地改變,蒐集每個快照所需的時間要盡可能地縮短以維持快照的準確度。為了達成目的,我們建制了Third-party-to-servent Crawling with Servent-to-servent Sampling (TCSS)系統來蒐集相關數據,並分析Churn和P2P網路效能的關係。
TCSS利用第三方爬蟲系統收集P2P網路拓樸以降低數據收集時對P2P網路的干擾。TCSS更進一步地應用分散式架構與平行化技術來加速數據的收集。整個爬蟲系統包含一個中央資料庫、一個工作分派伺服器與多個(目前有十一個)爬蟲客戶端。中央資料庫會儲存所有被探訪過或新發現的節點紀錄。工作分派伺服器負責分配未被探訪的節點至各個爬蟲客戶端。爬蟲客戶端則負責收集拓樸數據,結合非同步I/O增加平行度。除此之外,每個爬蟲利用multiprocessing workers從已收集資料中篩選出未被探訪的P2P節點,並將這些節點紀錄於中央資料庫中,再配合快取技術增進其篩選效率。另外,TCSS所應用的Servent-to-servent取樣技術為透過數個modified servent於P2P網路中取得具代表性P2P節點的效能數據。Servent-to-servent取樣方法採取Metropolized Random Walk with Backtracking (MRWB)技術,大幅縮減樣本總數以增進數據收集的效率。
透過高度平行化的系統設計和取樣技術的實現,TCSS突破了現有非結構式P2P的拓樸爬蟲系統 (crawling system)中,無法蒐集效能數據的限制。我們的實驗結果說明TCSS可在短時間內蒐集到精準的P2P網路拓撲快照 (一個快照所需時間約為七分鐘)。
我們針對非結構式P2P網路Gnutella不間斷地蒐集了20個小時的數據資料,共163個拓樸快照,每個快照的平均節點個數超過一百萬。這些資料先經過我們設計的peer elimination policy淘汰不合格的節點資訊,以降低誤差,再對資料做廣泛地探討。從分析中,我們共有三個主要的發現。從拓樸面來看: 第一、Churn的變化除了受節點鄰居 (neighbors)異動影響外,還受節點本身的鄰居選擇演算法影響。我們的資料顯示被替換的鄰居有可能還存在於P2P網路中,而並非下線離開。第二、資料中,有一批極長壽的節點,即使拓樸規模增長,其個數幾乎為一常數,且整個P2P網路仍具有小世界網路 (small-world network)的特質 (任意兩點間的最短距離大多為4或5)。再以效能面,從節點初始化的時間與關鍵字搜尋所需的時間來看第三個發現: 隨著Churn的影響加劇,個別節點的效能並非絕對地降低,但是,效能下降的平均機率會相對地提高,且效能的變異程度也會增高。在關鍵字搜尋的分析結果中,我們可以看到搜尋“最受歡迎”關鍵字所需時間受Churn影響不大,但是,搜尋“較不受歡迎”關鍵字所需時間,其變動幅度卻隨Churn加劇而升高,且受影響的節點個數也隨之上升。
總結而言,本論文提出一套用於收集非結構式P2P網路數據的系統設計與實作技術。接著再進一步地蒐集與分析實驗數據,探討Churn對於非結構式P2P的影響,而這些影響尚未有人深入研究過,研究結果對於P2P網路的研究有極大的參考價值。 This thesis presents an empirical study of the effect of “churn” on the real-world unstructured Peer-to-Peer (P2P) applications. Over the past decade, P2P applications have grown radically on the Internet. In P2P networks, peers (also referred to as servents) may connect with different neighbor peers over time due to the dynamic joining and leaving of peers. Such dynamic interconnection phenomenon is referred to as “Churn” and may seriously affect the performance of P2P applications. The effect of Churn on the performance of structured P2P networks has already been well-studied. However, little research has been conducted on the effect of churn on unstructured P2P networks because of the operational complexity rooted from the unstructured topologies. Therefore, this thesis turns its focus onto the investigation of the effect of Churn on unstructured P2P networks. In order to analyze the correlation between Churn and the performance of a P2P network, we need to collect a dataset of the P2P network topology snapshots and the associated performance metrics. Dataset collection has three requirements, namely large number of representative peers, short collection time, and little interference to the P2P networks, keeping in mind that these three requirements may contravene with one another at times. In general, millions of peers exist on a P2P network and each snapshot must contain sufficient number of representative peers. However, because P2P topologies change over time, the collection time of each snapshot must be short enough for the snapshots to be accurate. Therefore, we propose a Third-party-to-servent Crawling with Servent-to-servent Sampling (TCSS) system to collect the datasets for analyzing the effects of Churn on the performance of a P2P network. TCSS utilizes a third-party crawling technique to collect a P2P network topology without disturbing the original P2P network topology under investigation. Furthermore, TCSS adopts a distributed and parallel crawling system to speed-up the crawling process. The crawling system consists of one central repository, one dispatching server and eleven crawling clients. The central repository stores all peers discovered by the crawling clients. The dispatching server is responsible for dispatching new peer crawling jobs to balance the loads of the crawling clients. The crawling clients perform the crawling work by using asynchronous I/O to increase parallelism. Besides, each crawling client forks five worker processes to determine and store the newly discovered peers in the central repository. The crawling clients also adopt cache technique for workers to speed-up the new peer discovery processes. In addition, TCSS also employs a Servent-to-servent Sampling technique, in which many customized lightweight servents contact with the representative servents selected from the target P2P network, to gather the corresponding performance metrics of the P2P network simultaneously. Servent-to-servent Sampling adopts Metropolized Random Walk with Backtracking (MRWB) to select the representative servents from the mass population of peers and significantly reduce the number of servents for the customized servent to contact with. With the highly parallel crawling system and Servent-to-servent Sampling technique, TCSS surpasses the limitation in current crawling systems for unstructured P2P networks, where no performance metrics are included in the corresponding topology snapshots. Empirical results show that TCSS can capture an accurate topology snapshot of the P2P network in a short period (around 7 minutes per snapshot). We have conducted a broad measurement on our target P2P network, Gnutella, continuously for 20 hours and collected a dataset of 163 snapshots. Each snapshot contains a topology of more than one million of peers and corresponding performance metrics. We used a peer elimination policy to eliminate peers with inaccurate information and then investigated thoroughly the collected dataset and discovered three significant findings. First, Churn is indeed the combined effects of peer arrivals/departures and neighbor selection algorithm. From the dataset, we can observe that peers may change their neighbors even if their previous neighbors still exist in the P2P network. Second, as the number of peers increases, the number of very long-life peers nearly remains constant and the P2P network possesses a small world property (the shortest paths of any two peers are mostly four or five hops). Third, as churn aggravates, the booting time of each peer does not certainly decline. However, the booting times of peers indeed decline on the average but the variation increases as Churn aggravates. As for the response time of keyword searches, the top rank keyword searches are nearly not affected by the degree of churn. However, the response times of lower rank keywords increases significantly as Churn becomes aggravated. In summary, this thesis presented the design and implementation of a framework that can be used to collect dataset of the unstructured P2P networks. Furthermore, it also provides thorough empirical results of the effects of Churn on unstructured P2P networks that are not disclosed previously. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079855502 http://hdl.handle.net/11536/48237 |
顯示於類別: | 畢業論文 |