標題: | 單指令多資料流多核心引擎設計初期軟硬體協同設計 Early Stage HW/SW Codesign of Multi-PE SIMD Engine |
作者: | 吳聲昀 劉志尉 電子研究所 |
關鍵字: | 多核心架構;單指令流多資料流架構;軟硬體協同設計;物件偵測;Multi-PE;SIMD;HW/SW codesign;Object-detection |
公開日期: | 2010 |
摘要: | 對於日趨複雜的多媒體或通訊嵌入式應用,單指令多資料流多核心架構為一極佳解決方案,可兼顧使用彈性以及高運算需求。運算單元的配置及核心之間的溝通架構為單指令多資料流多核心架構中關鍵的設計考量。欲使得所投資之硬體資源發揮最大的效果,必須針對應用之平行度特性探索適合的設計參數。透過設計初期之軟硬體協同設計可及早探索適合的軟體演算法及適當的硬體配置,能大大加速設計時間。本篇論文提出一針對單指令多資料流多核心加速器設計之多執行緒單指令多資料流架構函式庫,讓設計者能將原先以高階語言敘述之應用程式,以最小幅度改寫便能標示低階硬體特殊設計之利用。配合參數化之多核心模擬工具,可快速估算出不同軟體演算法及硬體配置下之效能。本文並以此函式庫針對物件偵測做演算法平行化處理以及指令集架構設計。針對物件偵測之關鍵運算,在運算單元中加入矩形定址法,可提升兩倍效能。發展適用於較長單指令多資料流單元之混合式平行化演算法,使用512位元單指令多資料流單元時可提升效能達4~5倍。最後使用本工具針對三種不同工作切割之演算法,分析其對單指令多資料多核心及其共享記憶體架構之效能影響。除了早期設計空間探索,本工具亦可做為特定單指令多資料流多核心架構之快速模擬器。透過調整硬體模型之參數,可於實際硬體平台尚未完成前提供估計之軟體效能數據,協助提早進行軟體設計。 Multi-PE SIMD engine can provide powerful parallel processing ability, which is a compromise solution for next generation embedded applications such as intelligent multimedia processing and communication. The arrangement of functional units and the communication architecture between PEs are critical design issues in designing multi-PE SIMD engine. To take full advantage of hardware resource, appropriate design parameters should be explored for target application. Early stage hardware and software codesign explores suitable hardware configuration and software algorithm in early design stage, which can greatly shorten design time. This thesis proposes a multithreaded scalable-SIMD library for multi-PE SIMD engine design. Using the library, designer can specify low-level operations in high-level source with little modification. Performance of resulted software under different hardware configuration is quickly estimated by multi-PE simulation tool. Early stage codesign of a multi-PE SIMD engine for object detection is performed as a case study. Rectangle addressing is integrated in each PE, which can speed up the detection kernel by 2 times. Hybrid vectorization is design for long SIMD, which gains 4~5 speedup using 512-bit SIMD. Three task-partition algorithms are implemented to observe TLP/DLP trade-off with different memory architectures. Besides design space exploration, the tool set can also be a high-speed simulator for specific multi-PE SIMD architecture. Once proper hardware parameters are set, estimated software performance can be available before hardware platform is ready. It can move software design earlier and accelerate design flow. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079711641 http://hdl.handle.net/11536/44342 |
Appears in Collections: | Thesis |