標題: | 適用於分散式暫存記憶體多核心平台之多媒體多解析處理應用最佳化 Optimizing multi-resolution applications on distributed scratchpad memory multicore architecture |
作者: | 甘禮源 Kan, Li-Yuan 劉志尉 Liu, Chih-Wei 電子研究所 |
關鍵字: | 多媒體多解析應用;分散式暫存記憶體多核心平台;物體偵測;multi-resolution application;distributed scratchpad memory multicore;object detection |
公開日期: | 2010 |
摘要: | 相較於使用快取記憶體的多核心架構,分散式暫存記憶體多核心平台可以達到功率效率的特性,適用於可攜式電子產品等嵌入式系統中。但是要利用分散式暫存記憶體多核心平台去加速運算,軟體程式的撰寫卻非常困難,不但花費時間而且容易出錯。分散式暫存記憶體多核心平台為程式撰寫帶來許多挑戰,包含核心間同步、工作負載平衡,甚至包括資料的讀取上都考驗程式設計者。在這篇論文中,我們針對多媒體多解析的應用去做分析和探討,希望可以在分散式暫存記憶體多核心平台有效率地去利用硬體資源。多解析處理應用常見於多媒體應用中,像是影像處理、訊號處理及智慧型多媒體處理。它會針對於不同解析度進行反覆的運算處理、蘊含豐富的資料區域性,需要頻繁的資料傳輸去完成運算。在分散式暫存記憶體多核心平台實現此種應用時,常因為資料傳輸的問題,導致效能上無法達到線性的加速。因此本論文提出一個基於『資料區域性』為主軸的切割方式,透過妥善利用不同解析度間之資料區域性及有效率的暫存記憶體使用來減少不必要的資料讀取;藉此方式達到降低核心間資料網絡的負擔,也避免發生記憶體資源的競爭。本研究以物體辨識當作一個例子,利用Cell多核心處理器做為我們的實驗平台;所提出的方法可以減少95%的資料傳輸且達到35.1%的效能提升,並且在六顆處理器的平行處理下,相較於CellCV版本在一顆處理器上的處理時間,可以達到5.6的增益。 Compared to cache-based multicore architecture, distributed scratchpad memory multicore architecture is more power efficient, which makes it suitable for embedded system. However, software programming for distributed scratchpad memory multicore is rather complicated and time-consuming. It brings new challenges including synchronization, workload balancing and even memory transfer. In this thesis, we focus on parallelization of multi-resolution applications on distributed scratchpad memory architecture. Multi-resolution application is used in variety domains like video compression, signal processing and intelligent multimedia. It contains multi-level data locality, repeated computations are performed on different resolutions. Abundant data transfers are demanded to complete the operation. Therefore, the memory transfer issue usually prevents multi-resolution application from getting linear acceleration on multicore. We propose a data-oriented task partition to achieve balanced workload and low-data-transfer by take advantage of inter-resolution locality. On Sony Playstation3, a typical distributed scratchpad memory multicore with 6 cores, object detection is parallelized using proposed data-oriented task partition. According to the experimental results, we obtain 5.6 times speedup from running on single core. 95% data transfer can be reduced and up to 35.1% performance improvement is obtained. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079611657 http://hdl.handle.net/11536/41782 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.