標題: 以訊息關係為基礎的檢查點及回復演算法
Checkpointing and Recovery Algorithms Based on Message Dependency
作者: 吳賦哲
Wu, Fu-Che
黃廷祿
Ting-Lu Huang
資訊科學與工程研究所
關鍵字: 檢查點;回復;一致;發生在前;鋸齒路徑;訊息相關;checkpointing;recovery;consistent;happen before;zigzag path;message dependency
公開日期: 1995
摘要: 摘要 在分散式系統中,要提供系統具有發生錯誤後,可回復執行的能力,一般 可藉訊息登錄讓 系統可重複原來的執行,再回到錯誤發生之前。或藉檢 查點直接回到一致的系統狀態重新 執行。在我們論文中,我們首先介紹 四種可決定全域狀態是否一致的的關係:即鋸齒路徑 (Z)、發生在前(H) 、檢查點基礎的發生在前(CH)、和訊息相關(M)。這些關係中具有下列 的性質: Z > H > CH > M。M的集合最小,表示與某點有訊息關係的點最 少。在實作上, 只須在訊息上增加O(1)的訊息量便足以決定兩檢查點間 是否有訊息相關。我們的演算法以 訊息相關為基礎,所以,負擔很小。 接著,我們介紹作檢查點的演算法,對一不常發生錯 誤的系統而言,應 考慮採用在沒有錯誤發生時是較少負擔的作法,我們的演算法,將所有 的控制訊息都附在應用訊息之上。此外,為了維持各檢查點間的因果關係 ,我們以訊息上 所帶檢查點的值,來判斷是否該作一檢查點。最後在回 復時,我們將由各檢查點的相關向 量,找出一組彼此間沒有訊息相關的 最大的一致全域檢查點。這樣的作法,不但不須要馬 上協調所須的額外 控制訊息,也不會有骨牌效應。缺點是作檢查點時須要較多的儲存空間, 及回復時須要O(N3)的時間去建構一致的檢查點。 abstract In distributed systems, failure recovery could be achieved either by way ofmessage logging, with which we can replay the original run to the state prior tofailure occurs, or by way of checkpoting, with which we can directly rollback toa consistent global checkpoints to start a new run. In our thesis, we firstintroduce four relations, zigzag path(Z), happen before( H),checkpoint-basedhappen before(CH) and message dependency(M), to determine if the globalcheckpoints are consistent or not. The relations have the following property:Z > H > CH > M. Set M is smallest one. It means that the number of checkpointswhich have message dependency with some checkpoint is least. As far asimplementation is concerned, it only requires appending O(1) additional messagesto each application message to determine if the checkpoints have messagedependency with each other or not. Our algorithm is based on message dependency;thus, the overhead is small. Secondly, we present the checkpointing algorithm.In a system, if a failure seldom occurs, we must choose an approach with lessoverheads while the failure does not occur. In our algorithm, all of the controlmessages are piggybacked on each application message. Besides, to maintaincausal relation among checkpoints, we use the value tagged on each message todecide if checkpointing is required or not. Finally, in our recovery algorithm,by comparing the dependency vector(DV), we can construct a maximum globalcheckpoints without any message dependency between them. Our algorithm does notneed extra control message required in eager checkpoint coordination. Inaddition, it is domino free. However this algorithm requires more storages forcheckpointing and the time complexity for recovery is O(N3).
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840392033
http://hdl.handle.net/11536/60376
顯示於類別:畢業論文