标题: 单指令多资料流多核心引擎设计初期软硬体协同设计
Early Stage HW/SW Codesign of Multi-PE SIMD Engine
作者: 吴声昀
刘志尉
电子研究所
关键字: 多核心架构;单指令流多资料流架构;软硬体协同设计;物件侦测;Multi-PE;SIMD;HW/SW codesign;Object-detection
公开日期: 2010
摘要: 对于日趋复杂的多媒体或通讯嵌入式应用,单指令多资料流多核心架构为一极佳解决方案,可兼顾使用弹性以及高运算需求。运算单元的配置及核心之间的沟通架构为单指令多资料流多核心架构中关键的设计考量。欲使得所投资之硬体资源发挥最大的效果,必须针对应用之平行度特性探索适合的设计参数。透过设计初期之软硬体协同设计可及早探索适合的软体演算法及适当的硬体配置,能大大加速设计时间。本篇论文提出一针对单指令多资料流多核心加速器设计之多执行绪单指令多资料流架构函式库,让设计者能将原先以高阶语言叙述之应用程式,以最小幅度改写便能标示低阶硬体特殊设计之利用。配合参数化之多核心模拟工具,可快速估算出不同软体演算法及硬体配置下之效能。本文并以此函式库针对物件侦测做演算法平行化处理以及指令集架构设计。针对物件侦测之关键运算,在运算单元中加入矩形定址法,可提升两倍效能。发展适用于较长单指令多资料流单元之混合式平行化演算法,使用512位元单指令多资料流单元时可提升效能达4~5倍。最后使用本工具针对三种不同工作切割之演算法,分析其对单指令多资料多核心及其共享记忆体架构之效能影响。除了早期设计空间探索,本工具亦可做为特定单指令多资料流多核心架构之快速模拟器。透过调整硬体模型之参数,可于实际硬体平台尚未完成前提供估计之软体效能数据,协助提早进行软体设计。
Multi-PE SIMD engine can provide powerful parallel processing ability, which is a compromise solution for next generation embedded applications such as intelligent multimedia processing and communication. The arrangement of functional units and the communication architecture between PEs are critical design issues in designing multi-PE SIMD engine. To take full advantage of hardware resource, appropriate design parameters should be explored for target application. Early stage hardware and software codesign explores suitable hardware configuration and software algorithm in early design stage, which can greatly shorten design time. This thesis proposes a multithreaded scalable-SIMD library for multi-PE SIMD engine design. Using the library, designer can specify low-level operations in high-level source with little modification. Performance of resulted software under different hardware configuration is quickly estimated by multi-PE simulation tool. Early stage codesign of a multi-PE SIMD engine for object detection is performed as a case study. Rectangle addressing is integrated in each PE, which can speed up the detection kernel by 2 times. Hybrid vectorization is design for long SIMD, which gains 4~5 speedup using 512-bit SIMD. Three task-partition algorithms are implemented to observe TLP/DLP trade-off with different memory architectures. Besides design space exploration, the tool set can also be a high-speed simulator for specific multi-PE SIMD architecture. Once proper hardware parameters are set, estimated software performance can be available before hardware platform is ready. It can move software design earlier and accelerate design flow.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079711641
http://hdl.handle.net/11536/44342
显示于类别:Thesis