應用於稀薄氣體動力學之平行化直接模擬蒙地卡羅法

Title:	應用於稀薄氣體動力學之平行化直接模擬蒙地卡羅法 Parallel Direct Simulation Monte Carlo (DSMC) Methods for Modeling Rarefied Gas Dynamics
Authors:	蘇正勤 Su, Cheng-Chin 吳宗信 Wu, Jong-Shinn 機械工程系所
Keywords:	稀薄氣體動力學;平行化直接模擬蒙地卡羅法(DSMC);圖形處理器(GPU);MPI-CUDA;大尺度模擬;接近連續流;Rarefied Gas Dynamics;parallel direct simulation Monte Carlo (DSMC);graphics processing unit (GPU);MPI-CUDA;large-scale simulation;near-continuum flow
Issue Date:	2012
Abstract:	在不同的研究領域當中，稀薄氣體動力學扮演著非常重要的角色，其中包括：極音速流體力學、真空泵技術、低壓半導體相關材料製程，以及微米或奈米尺度的氣動力學等。能夠描述其稀薄氣體流動行為的波茲曼方程式，是非常難以求解的。而直接模擬蒙地卡羅法(DSMC)是一種particle-based的方法。在統計上，只要足夠大的模擬分子數量，對於求解波茲曼方程式中，它被視為是最為有效和準確的方法。然而，它的計算量通常是非常高的。尤其是在過渡流以及接近連續流的區域。因此，為了有效應用於稀薄氣體流場，藉由DSMC的平行處理，以減少計算時間，是不可或缺的。在本論文中，將探討兩種主要的平行運算架構之直接模擬蒙地卡羅法。包括：應用非結構性網格與訊息傳遞介面(message passing interface, MPI)之新的平行化二維/三維DSMC程式，以及應用結構性網格與混合MPI-CUDA (CUDA: Compute Unified Device Architecture)之平行化二維DSMC程式，其簡述如下。在論文第一部分：發展並驗證了以C++語言為架構之新的通用平行化混合非結構網格之二維/三維DSMC程式(名為PDSC++)。本論文將探討PDSC++程式的幾個主要特點，包括：可變時步法(variable time-step scheme, VTS)、瞬態調適次網格法(transient adaptive sub-cell method, TAS)，以及DSMC的平行處理。在VTS方法中，根據局部平均自由路徑，其每個網格的時步與模擬分子的權值將成正比。並確保其質量、動量與能量守恆情況下，產生一有效率的非結構性網格之分子追蹤算法。比較其固定時步法(constant time-step scheme)，總模擬分子數量將大為減少。在TAS方法中，每個網格根據其局部平均自由路徑或模擬分子數量，動態調整其子網格數目，使得平均碰撞距離小於局部平均自由路徑。結果顯示，TAS結合VTS方法，將大量減少其DSMC模擬計算時間，並同時保持非常高的分子間碰撞的質量。在DSMC的平行處理上，將不採用動態計算域切割(dynamic domain decomposition)，而另提出了一個簡單有效的計算域重新切割方法(domain re-decomposition method, DRD)，以增加平行化DSMC模擬的平行計算效率。其結果顯示，對於大尺度的問題計算，使用台灣國家高速網路與計算中心(NCHC)御風者(ALPS)叢集式電腦的192個計算核心，可加速到123至135倍計算效率。此外，藉由使用御風者電腦的768個處理器，計算一具有十億模擬分子的三維問題，其結果也驗證了PDSC++程式的強大計算能力。在論文第二部分中：提出了以MPI-CUDA為平行計算架構，並使用多圖形處理器(GPUs)加速二維直接模擬蒙地卡羅程式，稱之為PDSC-MG。所有的計算幾乎不在CPU上，而在GPU上做運算，包括：分子移動(particle moving)、索引(indexing)、碰撞(collision)與採樣(sampling)。在多GPU計算上，則是透過訊息傳遞介面(MPI)，來傳輸彼此間的資料。在本方法中，MPI被用來散佈與收集不同處理器之間的資料，以及所有處理器之間的溝通。透過Compute Unified Device Architecture (CUDA)，GPU用來加速DSMC有關的計算部分。而CUDA是General Purpose computing on Graphics Processing Units (GPGPU)的其中一種方式。結果顯示，當使用單一GPU和16張GPU卡時(NVIDIA Tesla M2070)，分別與CPU(Intel Xeon X5670)的單一核心比較，其計算時間可分別減少16倍和185倍。在使用16張GPU卡時，模擬3000萬分子，其平行效率為75％。最後，藉由在接近連續流區域的數個大尺度的計算，驗證了目前的平行化DSMC程式出色的計算能力。結果顯示，使用16 張GPU卡(NVIDIA Tesla M2070)，模擬2.55億個分子、640萬網格，以及12萬時步時，所需的計算時間約21.81小時。在本論文最後，總結目前的研究成果，並提出未來的研究方向與工作。 Rarefied gas dynamics has played an important role in various research disciplines, which include hypersonic fluid dynamics, vacuum pump technology, low-pressure semiconductor related materials processing, and micro- and nano-scale gas dynamics, to name a few. The Boltzmann equation that governs rarefied gas dynamics is generally very difficult to solve. The particle-based method, the direct simulation Monte Carlo (DSMC) method, has been considered as the most efficient and accurate numerical method for solving the Boltzmann equation statistically, as long as the number of simulation particle is large enough. However, its computational expense is generally very high, especially in the transitional and near-continuum flow regimes. Thus, parallel processing of the DSMC method to reduce the computational time is necessary for an efficient application of the method in general rarefied gas dynamics. In this thesis, two major categories of parallel processing for the DSMC method are presented. These include a new parallel 2D/3D DSMC code with an unstructured grid using message passing interface (MPI) and a parallel 2D DSMC code with a structured grid using hybrid MPI-CUDA (CUDA: Compute Unified Device Architecture), which are described briefly next. In the first part of the thesis, a new general-purpose parallel 2D/3D DSMC (named PDSC++, hereafter) based on the C++ language using a 2-D or 3-D hybrid unstructured grid was developed and validated. Several key features of the PDSC++ code are presented and discussed in the thesis, including a variable time-step (VTS) scheme, a transient adaptive sub-cell (TAS) method, and parallel processing of the DSMC method. For the VTS scheme, the simulation time step, which is proportional to the weight of simulation particles, varies in each cell based on local mean free path. This leads to an efficient particle tracing algorithm on an unstructured grid, which enforces conservation of mass, momentum and energy. This results in great reduction of total simulation particles as compared to the constant time-step scheme. For the TAS method, a dynamically adaptive number of sub-cells, based on the local mean free path or number of simulation particles, is imposed in each cell to ensure the average collision distance is less than the local mean free path. The results show that this TAS method coupled with the VTS scheme results in great reduction of computational time of DSMC while maintaining very high quality of collision between particles. For the parallel processing of DSMC method, a simple and efficient method which is termed as domain re-decomposition (DRD) method is presented to improve the parallel performance of parallel DSMC simulation without resorting to dynamic domain decomposition. The results indicate that up to 123-135 times of speedup can be reached using 192 processors for the large scale problems which is performed at the ALPS cluster of the National Center for High-Performance Computing (NCHC), Taiwan. In addition, we also have demonstrated the powerful capability of the PDSC++ code by simulating a three-dimensional problem with one billion simulation particles using 768 cores of ALPS. In the second part of the thesis, the development of the two-dimensional direct simulation Monte Carlo (DSMC) code using an MPI-CUDA parallelization paradigm on Graphics Processing Units (GPUs) clusters, named PDSC-MG, is presented. An all-device (i.e. GPU) computational approach is adopted where the entire computation is performed on the GPU device, leaving the CPU idle during all stages of the computation, including particle moving, indexing, particle collisions and state sampling. Communication between the GPU (device) and the CPU (host) is only performed to enable multiple-GPU computation by Message Passing Interface (MPI) protocol. In this method, the MPI protocol is used to distributed/gather data into/from memory of different MPI-processors and communication between all MPI-processors (CPU). GPU is used to accelerate the DSMC-related simulation components by the Compute Unified Device Architecture (CUDA) which is one of the General Purpose computing on Graphics Processing Units (GPGPU). Results show that the computational expense can be reduced by 16 and 185 times when using a single GPU and 16 GPUs (NVIDIA Tesla M2070), respectively when compared to a single core of CPU (Intel Xeon X5670). The demonstrated parallel efficiency is 75% when using 16 GPUs as compared to a single GPU for simulations using 30 million simulated particles. Finally, several very large-scale simulations in the near-continuum regime are employed to demonstrate the excellent capability of the current parallel DSMC method. Results show that approximately 21.81 hours are required for 120,000 simulation time steps with approximately 255 million particles and 6.4 million cells using 16 GPU devices (NVIDIA Tesla M2070). At the end of the thesis, major findings are summarized and directions of future work are outlined.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT079714820 http://hdl.handle.net/11536/72164
Appears in Collections:	Thesis

APA	蘇., Su, C., 吳., & Wu, J. (2012). 應用於稀薄氣體動力學之平行化直接模擬蒙地卡羅法. http://hdl.handle.net/11536/72164.
Bibtex	@article{蘇正勤 and Su2012, title={應用於稀薄氣體動力學之平行化直接模擬蒙地卡羅法}, author={蘇正勤 and Su, Cheng-Chin and 吳宗信 and Wu, Jong-Shinn}, journal={http://hdl.handle.net/11536/72164}, year={2012}, url={https://ir.lib.nycu.edu.tw/handle/11536/72164}, }