# 國 立 交 通 大 學 電機資訊學院 資訊學程 碩士論文

## **TI TMS320C55x**' **DSP** 架構下執行模乘法運算之效率研究

A study of the performance of operating modular multiplication based on TI TMS320C55x' DSP



- 研 究 生:薛明宏
- 指導教授:葉義雄 教授
- 中 華 民 國九 十 四 年 六 月

#### **TI TMS320C55x**' **DSP** 架構下執行模乘法運算之效率研究

A study of the performance of operating modular

## multiplication based on TI TMS320C55x' DSP



指導教授:葉義雄 Advisor Dr. Yi-Shiung Yeh

國 立 交 通 大 學

電機資訊學院 資訊學程



Submitted to Degree Program of Electrical Engineering and Computer Science College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science In Computer Science June 2005 Hsinchu, Taiwan, Republic of China



#### **TI TMS320C55x**' **DSP** 架構下執行模乘法運算之效率研究

#### 學生:薛明宏 指導教授:葉義雄博士

國立交通大學電機資訊學院 資訊學程 研究所﹚碩士班

#### 摘 要

現階段由於無線通訊技術發達,許多無線移動運算平台也應運而 生,例如,無線電話、個人數位助理及具無線上網功能之筆記型電腦 等。由於移動計算平台與基地台溝通的媒介乃是空氣,因此加強資訊 安全保護便成了移動計算平台上一個重要的課題。採用加密系統時除 了安全性外,最重要的便是運算效能,由於模乘法乃現有密碼系統中 最主要的計算元件,加速模乘法的運算將可直接提昇整體密碼系統之 效能。本文針對適用於無線環境多功能的CPU TI TMS320C55'x DSP,進行其架構研究,發展出一套系統性的模乘法選用法則,並在 研究過程中發展出一套新的模乘法,此模乘法的運算效能較Morita & Yang演算法稍佳。

# A study of the performance of operating modular multiplication based on TI TMS320C55x' DSP

## **Student: Ming-Hung Hsueh Advisor: Dr. Yi-Shiung Yeh**

## Degree Program of Electrical Engineering Computer Science National Chiao Tung University

#### Abstract

Nowadays, because of the rapid progressive of wireless communication, a variety of mobile computational platforms are emerged, such as mobile phone, PDA, and notebook. The medium between mobile stations and base stations is air. As a result, to protect the information of mobile platform becomes an important issue. TI TMS320c55x' DSP is a high performance CPU, which can handle multimedia, voice compression, and voice communication, and also, has highly performance with low power consumption. Therefore, it is an ideal multipurpose CPU that can be introduce to mobile platform. In this thesis, we introduce a systemic model to estimate the performance of executing a modular multiplication on TI TMS320c55x' DSP. Moreover, we propose a new modular multiplication algorithm, which achieves higher performance than Morita & Yang's Algorithm.

### 致 謝

首先感謝指導教授葉義雄教授,在研究所求學階段授與 我專業知識及正確的研究態度。其次感謝公司長官們的包容 讓我在工作後還能有機會充實自我,更要感謝同事豐富、瑞 敏及嘉耀在我求學過程中所給予的協助。

最後感謝我的家人的支持,尤其是內人莉淳這段日子辛 苦負擔了大部分的家務與女兒們的教養,使我能無後顧之 憂,安心完成學業。





## **Contents**





## **List of Figures**





## **List of Tables**





#### Chapter 1 Introduction

#### **1.1 Modular Multiplication**

Diffie & Hellman<sup>[15]</sup> proposed the public key concepts in 1976. Since then, a lot of famous elegant public key crypto-systems had been proposed, such as ElGamal [14], in which the security is based on the difficulty of discrete logarithm problem, and RSA[9], in which the security is based on the large prime problem. Those cryptosystems open a new era of cryptology. Nowadays, crypto-system can provide not only confidential of information but also authenticity, integrity, and nonrepudiation. The progression of public key cryptosystems provides variety usages, for example, sharing key between strangers, making e-transaction reliably, and authenticating the issuer of information.

The core operation of some important public key cryptosystems, such as RSA, ELGamal, and DSA[4] is modular exponentiation. Executing a modular exponentiation takes the longest time running a public key encryption or decryption. As a result, improving the performance of executing modular exponentiation becomes a key point.

There are two main schemes to achieve higher performance of executing a modular exponentiation. First, reduce the amount of modular multiplications, such as Knuth's[3] M-ary approach and Chiou's Parallel scheme[1]. Second, improve the performance of executing a modular multiplication, such as [2][5][6].

1

#### **1.2 The Goal of This Thesis**

Nowadays, because of the rapid progressive of wireless communication, a variety of mobile computational platforms are emerged, such as mobile phone, PDA, and notebook. The medium between mobile stations and base stations is air. As a result, to protect information of mobile platform becomes an important issue.

TI TMS320c55x' DSP is a high performance CPU. It can handle multimedia, voice compression, voice communication, and highly performance with low power consumption. Therefore, it is a multipurpose CPU that can be introduced to mobile platform.

Our research is to introduce a systemic model to estimate the performance of executing the modular multiplications on TI TMS320c55x' DSP. Moreover, we propose a new modular multiplication algorithm which achieves higher performance than Morita & Yang's Algorithm.

#### Chapter 2 Implementation of the Basic Arithmetic of Modular Multiplication Algorithm on TI TMS320c55x' DSP

MS320C55x' DSP takes only one clock cycle to execute an addition, a subtraction, or a multiplication instruction. Therefore, we need a new method to evaluation a modular multiplication algorithm for the DSP.

In this chapter, we will briefly introduce the architecture of TI TMS320c55x' DSP in Section 2.1 and Section 2.2. Then, we discuss the implementation issues of some frequent used arithmetic of modular multiplication algorithms in Section 2.3.

#### **2.1 Architecture of TI TMS320c55x' DSP**

The architecture of TI TMS320c55x' DSP is shown as Fig. 2-1. The pipeline of TI TMS320c55x' DSP has four stages. There are,



Figure 2-1 The CPU Diagram[13]

- 1. Instruction buffer unit Machine codes of the instruction set of TI TMS320c55x' DSP are variety length. As a result, an instruction buffer is needed to smoothen the executing of programs.
- 2. Program flow unit This unit handles branch, conditional operation and pipeline protection. Pipeline protection is an automatic mechanism to avoid read-write hazard[7].
- 3. Address data flow unit This unit controls data addresses for data reading and writing. TI TMS320c55x' provides three data read buses and two data write buses.
- 4. Data computation unit This unit performs arithmetic and logic operations. There are two 17-bit by 17-bit MAC, a 40-bit ALU and a Shifter shown in Fig, 2-2.



Figure 2-2 The Data Computation Unit Diagram[13]

#### **2.2 Arithmetic Instructions of** TI TMS320c55x' DSP

#### 2.2.1 The Registers

1. There are four 40-bit Accumulators AC0, AC1, AC2, AC3. Accumulators are used to execute arithmetic and logic operations. The highest 8-bit of the accumulator is used for sign-extended purpose. As a result, the accumulators can perform 16-bit and 32-bit operations.

2. There are eight 24-bit auxiliary registers XAR0~XAR7. The auxiliary registers can used to execute arithmetic and logic operations. The other purpose of auxiliary registers is data addressing. Only the lower 16-bit of auxiliary registers, also called as AR0~AR7, can be used to perform arithmetic and logic operations.

3. There are four 16-bit temporary registers T0, T1, T2, and T3. Temporary registers are used for data addressing and to execute arithmetic and logic operations.

The symbols of the instruction set of TI TMS320c55x' DSP are denoted as:

dst, src represent ACn, ARn, and Tn

Smem Xmem Ymem means the value of an address

k4, k8, k16 means 4-bit, 8-bit, and 16-bit constants

2.2.2 Left Shift

The left-shift instructions of TI TMS320c55x'DSP with each left-shift instruction taking only one clock cycle are shown in Fig. 2-3. Note that only some of the left-shift instructions can be performed a 32-bit left-shit.



#### Figure 2-3 Left Shift Instructions[11]

#### 2.2.3 Addition

The addition instructions of TI TMS320c55x'DSP with each addition instruction taking only one clock cycle are shown in Fig. 2-4. Note that only some of the addition instructions can be performed a

32-bit addition.



| Syntax                                   | Parallel<br>Enable Bit | Size | Cycles | Pipeline |
|------------------------------------------|------------------------|------|--------|----------|
| ADD [src.] dst                           | Yes                    | 2    | 1      | х        |
| ADD k4, dst                              | Yes                    | 2    |        | х        |
| ADD K16, [src.] dst                      | No                     | 4    |        | х        |
| ADD Smem, [src.] dst                     | No                     | 3    | 1      | Х        |
| ADD ACx << Tx                            | Yes                    | 2    |        | Х        |
| ADD ACx << #SHIFTW, ACy                  | Yes                    | 3    |        | Х        |
| ADD K16 << #16, [ACx,] ACy               | No                     | 4    | 1      | Х        |
| ADD K16 << #SHFT, [ACx,] ACy             | No                     | 4    | 1      | х        |
| ADD Smem << Tx, [ACx,] ACy               | No                     | 3    | 1      | Х        |
| ADD Smem << #16, [ACx,] ACy              | No                     | 3    |        | х        |
| ADD [uns(]Smem[)], CARRY, [ACx,] ACy     | No                     | 3    |        | х        |
| ADD [uns(]Smem[)], [ACx,] ACy            | No                     | 3    |        | х        |
| ADD [uns(]Smem[)] << #SHIFTW, [ACx,] ACy | No                     | 4    |        | х        |
| ADD dbl(Lmem), [ACx,] ACy                | No                     | 3    |        | х        |
| ADD Xmem, Ymem, ACx                      | No                     | 3    |        | Х        |
| ADD K16, Smem                            | No                     | 4    |        | х        |
| ADDV [ACx.] ACy                          | Yes                    | 2    |        | х        |
|                                          |                        |      |        |          |

Figure 2-4 Addition Instructions[11]

#### 2.2.4 Subtraction

The subtraction instructions of TI TMS320c55x'DSP with each

subtraction instruction taking only one clock cycle are shown in Fig. 2-5. Note that only some of the subtraction instructions can be performed a 32-bit subtraction.



#### 2.2.5 Multiplication

The multiplication instructions of TI TMS320c55x'DSP with each multiplication instruction taking only one clock cycle are shown in Fig. 2-6 and Fig. 2-7. Note that only 16-bit by 16-bit multiplication can be performed.



## Figure 2-6 Multiplication Instructions (1) [11]





Figure 2-7 Multiplication Instructions (2) [11]

#### **2.3 The Implementation issues of Frequent Used Arithmetic of Modular Multiplication Algorithm on TI TMS320c55x' DSP**

The implementations in this chapter are also referred to section 5.1 and section 5.4 in *TMS320C55x DSP Programmer's Guide*[12]. The fixed-point arithmetic is in section 5.1 and the division is in section 5.4.

#### 2.3.1 Pipeline Protection Issue

TI TMS320c55x' DSP provides an automatic pipeline protection mechanism to prevent read-write hazard. However, it also causes the extra clock cycles. We use a 1024-Bit addition as an example shown in Fig. 2-8.

| 01              | AMOV #(VarA), XAR1           |
|-----------------|------------------------------|
| 02              | AMOV #(VarB), XAR2           |
| 03              | AMOV #(AResult), XAR3        |
| 04              | 896<br>MOV #62, T0           |
| 05              | MOV #64, T1                  |
| 06              | MOV40 dbl(*AR1(T0)), AC0     |
| 07              | ADD dbl(*AR2(T0)), AC0       |
| 08              | MOV AC0, dbl(*AR3(T1))       |
| 09              | <b>SFTS AC0, #-32</b>        |
| 10 <sup>°</sup> | MOV #30, BRC0                |
| 11              | RPTBLocal Adder_Loop         |
| 12 <sub>2</sub> | SUB #2, T0                   |
| 13              | MOV40 dbl(*AR1(T0)), AC1     |
| 14              | SUB #2, T1                   |
| 15 <sub>1</sub> | ADD #1, T0                   |
| 16              | ADD uns(*AR2(T0)), AC1       |
| 17              | ADD AC1, AC0                 |
| 18              | <b>SUB #1, T0</b>            |
| 19              | ADD uns(*AR2(T0))<< #16, AC0 |
| 20              | MOV AC0, dbl(*AR3(T1))       |
|                 | 21 Adder_Loop:               |
| 22              | SFTS AC0, #-32               |
| 23              | <b>SUB #1, T1</b>            |
| 24              | MOV AC0, *AR3(T1)            |
|                 |                              |

Figure 2-8 1024-Bit addition code with RAW hazard

The above program contains some RAW hazards. The Adder\_loop is from the line 12 through the line 22. The loop uses 10 instructions. As we learn from Section 2.2, there must be only 10 clock cycles needed for each loop. However, 20 clock cycles are consumed for each loop. We discover that the extra clock cycles are affected by the lines 12, 13, 15, 16, 18, and 19. Let's analyze the codes of the lines 12 to 13.

12 SUB #2, T0

#### 13 MOV40 dbl(\*AR1(T0)), AC1

The result of T0 at the line 12 will be used by the line 13 immediately. As a result, it causes a RAW hazard[7]. Therefore, the automatic pipeline protection mechanism of TMS320c55x' inserts the four NOPs, i.e. the four bubbles into the pipeline to assure the correct value of T0 will be used by the line 13. The extra NOPs cause the extra clock cycles. At this case, running a 1024-bit addition costs about 650 clock cycles. It's almost 6 times of executing a 1024-bit by 16-bit multiplication.

We rewrite the codes in Fig. 2-9. The adder loop is from the line 12 through the line 22. The new loop uses 8 instructions and costs 10 clock cycles. To finish, a 1024-bit addition it costs about 350 clock cycles.



| 10      | SFTS AC0, #-32                  |
|---------|---------------------------------|
| 11      | MOV #30, BRC0                   |
| $12 \,$ | <b>RPTBLocal Adder_Loop</b>     |
| 13      | SUB #2, T0                      |
| 14      | SUB #2, T1                      |
| 15      | MOV40 dbl(*AR1(T0)), AC1        |
| 16      | ADD uns(*AR2(T1)), AC1          |
| 17      | ADD AC1, AC0                    |
| 18      | ADD uns $(*AR2(T0))<<#16$ , AC0 |
| 19      | MOV AC0, $db(^*AR3(T0))$        |
|         | 20 Adder_Loop:                  |
| 21      | SFTS AC0, #-32                  |
| 22      | <b>SUB #1, T0</b>               |
| 23      | MOV AC0, *AR3(T0)               |

Figure 2-9 1024-Bit addition code with less RAW hazard

It spent 11 clock cycles running a 64-bit addition in *TMS320C55x DSP Programmer's Guide*[12] costs only 11 clock cycles. As a result, if the loop unrolling skill[7] is adopted, running a 1024-Bit addition may cost 190 clock cycles only. However, this approach will be much larger source code and make source code difficult to maintain.  $X$  1896

2.3.2 Arithmetic of Modular Multiplication Algorithm on TI TMS320c55x' DSP

According to our implementation and taking of pipeline protection, the Clock Cycles Consumption of Frequent Used Arithmetic of Modular Multiplication is shown in Table. 2-1.

Table 2-1 The clock cycles consumption of frequent used arithmetic of modular multiplication algorithm



We use the 1024-bit by 16-bit multiplication as the baseline and weight other arithmetic for evaluation. The weighted clock cycles consumption is shown in Table 2-2. Table 2-2 will be used to evaluate modular multiplication algorithms in chapter 4.

Table 2-2 The weighted clock cycles consumption of frequent used arithmetic of modular multiplication algorithm on TI TMS320c55'

| Operation                  | Real          | Unrolling |
|----------------------------|---------------|-----------|
| 1024 bits shift            | .75           |           |
| 1024 bits + 1024 bits      | 2.15          | 0.95      |
| 1024 bits - 1024 bits      | 2.15          | 0.95      |
| 1024 bits $\times$ 16 bits |               |           |
| 32 bits $\div$ 16 bits     | $0.5^{\circ}$ |           |
| 1024 bits comparison       | Ი~Ნ           |           |



#### Chapter 3 Background on Iterative Modular Multiplication

In this chapter, we will go through some well know iterative modular multiplication algorithm. Our goal is to re-evaluate those algorithms computational complexity in general purpose CPU and TI TMS320c55x' DSP. We try to make a guideline for choosing a properly modular-multiplication algorithm for TI TMS320c55x' DSP.

#### **3.1 Single Precision Left-to-Right Convention Algorithm**

Single precision means "perform a modular multiplication bit by bit". Left-to- right convention means "perform a modular multiplication from the most significant bit to the less significant bit". All the single precision left-to-right convention algorithms are suitable for hardware implementation.

- 3.1.1 Traditional Algorithm of Modular Multiplication Algorithm Modula Multiplication  $(X, Y, D)$  $M$ nput: X, Y and D are the n-bit positive integers //Output:  $P = X \times Y \mod D$ , P is an n-bit positive integer 1:  $P \leftarrow 0$ : 2: for  $i := n-1$  downto 0 do 3: begin 4:  $P \leftarrow 2 \times P$ ; 5: if  $(P \ge D)$  then 6:  $P \leftarrow P - D$ ; 7: if(  $y_i = 1$  ) then 8: begin 9:  $P \leftarrow P + X$ ; 10: if  $(P \ge D)$  then
	- 11:  $P \leftarrow P D$ ;

12: end; 13: end; 14:  $return(P);$ 

#### **Time Complexity**

This algorithm is suitable for hardware because left shift costs nothing in hardware. Assume that P has 50% possibility larger than D. As a result, this algorithm needs n times of n-bit shift, n times of n-bit addition, and n times of n-bit subtraction.

Table 3-1 The time complexity of Traditional Algorithm



#### 3.1.2 Blakley Algorithm[5]

Algorithm Blakley $(X, Y, D)$ //Input: X, Y and D are the n-bit positive integers //Output:  $P = X \times Y \mod D$ , P is an n+1-bit positive integer 1:  $P \leftarrow 0$ : 2:  $e \leftarrow D - X;$ 3: for  $j := n-1$  downto 0 do 4: begin 5: if ( $P \ge D$ ) then 6:  $P \leftarrow 2 \times (P - D)$ ; 7: else 8:  $P \leftarrow 2 \times P$ ; 9: if( $y_i = 1$ ) then 10: if (  $P \geq D$ ) then 11:  $P \leftarrow P - e$ 12: else 13:  $P \leftarrow P$ 14: end; 15: if ( $P \geq D$ ) then 16:  $P \leftarrow P - D$ ; 17: return(P);

Blakley Algorithm introduced a pre-computed integer e. As a result, it costs less than the traditional modular multiplication algorithm.

#### **Time Complexity**

Assume that P has 50% possibility larger than D. As a result, this algorithm needs n times of n-bit shift, 1/2n times of n-bit addition, and n times of n-bit subtraction.

Table 3-2 The time complexity of Blakely's Algorithm



### **Storage Complexity**

The lengths of P and e are n+1 bits and n bits, respectively. Note that in process are need n+1 bits for the length of P, but we only output n bits for the length of P.



#### 3.1.3 Chiou & Yang's Algorithm[2]

Algorithm Chiou  $Yang1(X, Y, D)$  $//Input: X, Y, and D are the n-bit positive integers$ //Output:  $P = X \times Y \mod D$ , P is an n+1-bit positive integer 1:  $P \leftarrow 0$ : 2: R $\leftarrow$  2<sup>n</sup> mod D; 3: for  $j := n-1$  downto 0 do 4: begin 5:  $P \leftarrow 2 \times P$ ; 6: if  $(carrow(P))$  then 7:  $P \leftarrow P + R$ ; 8: if( $y_i = 1$ ) then 9: begin 10:  $P \leftarrow P + X;$ 11: if  $(carrow(P))$  then 12:  $P \leftarrow P + R$ ; 13: end; 14: end; 15: if ( $P \ge D$ ) then 16:  $P \leftarrow P - D$ ; 17: return(P);

Algorithm Chiou Yang  $2(X, Y, D)$ 

//Input: X, Y and D are the n-bit positive integers

//Output:  $P = X \times Y \mod D$ , P is an n-bit positive integer

- 1:  $P \leftarrow 0$ :
- 2:  $C \leftarrow 0$ ;
- 3: R1  $\leftarrow$  2<sup>n</sup> mod D;
- 4: R2 $\leftarrow$  2 × 2<sup>n</sup> mod D;
- 5: R3 $\div$  3  $\times$  2<sup>n</sup> mod D;
- 6: T1  $\leftarrow$  (2<sup>n</sup> + X) mod D;
- 7: T2  $\leftarrow$  (2 $\times$  2<sup>n</sup> + X) mod D;

8: T3  $\leftarrow$  (3  $\times$  2<sup>n</sup> + X) mod D; 9: for  $i := n-1$  downto 0 do 10: begin 11:  $P \leftarrow 2 \times P$ ; 12: if  $(carry(P))$  then 13:  $c \leftarrow c + 1;$ 14: if( $y_i = 1$ ) then 15: case c of 16: 0:  $z \leftarrow z + X;$ 17:  $1: z \leftarrow z + T1;$ 18:  $2: z \leftarrow z + T2;$ 19:  $3: z \leftarrow z + T3;$ 20: end case; 21: else 22: case c of 23:  $1: z \leftarrow z + S1$ 24:  $2: z \leftarrow z + z$ 25:  $3: z \leftarrow z + S3$ ; 26: end case; 27: if  $(carry(P))$  then  $\mathcal{L}_{\text{max}}$ 28:  $c \leftarrow 2$ ; 29: else 30:  $c \leftarrow 0;$ 31: end; 32: if ( $P \geq D$ ) then 33:  $P \leftarrow P - D;$ 34: return(P);

Chiou & Yang's Algorithm 1 uses  $Carry(P)$  $2^n$  and pre-computed table to perform the modular-multiplication. The time complexity for the algorithm is n times "n-bit left shift" and  $2/3n$  times "n-bit addition". It also requires  $2n+1$  bits storage space for P and R.

Later Chiou & Yang proposed another algorithm, called Algorithm 2, to improve the performance. The major difference between them is that Algorithm 2 using six-element pre-computed table instead of using one-element in Algorithm 1. In algorithm 2 it requires 7n+2 bits storage space for P and pre-computed table.

#### **Time and Storage Complexity**

The comparison between two Algorithms are shown in table 3-3.







#### **3.2 Multi-Precision Left-to-Right Convention Algorithm**

Multi-precision means "perform modular multiplication block by block". Left-to-right convention means "perform modular multiplication from the most significant block to the less significant block". All the multi-precision algorithms are more suitable for software implementation.

3.2.1 Leong, Tan &Tan's Algorithm [8]

Algorithm Leong Tan  $Tan(X, Y, D)$  $\mathcal{U}(P) = X \times Y \mod D$ , where  $0 \leq X, Y \leq D$  $\frac{n}{2}$  //  $D = 2^{n-1} + \sum_{n=2}^{n-2}$ =  $=2^{n-1}+\sum_{n=2}^{n-2}$ 0  $2^{n-1} + \sum_{i=1}^{n-2} d_i 2$ *i i*  $D = 2^{n-1} + \sum_{i=1}^{n-2} d_i 2^i,$   $X = \sum_{i=1}^{n-1} d_i$ = = 1  $\bf{0}$ 2 *n i*  $X = \sum x_i 2^i$  $\frac{n-1}{2}$ =  $=\sum_{n=1}^{n-1}$ 0 2 *n i*  $Y = \sum y_i 2^i$ ,  $P = \sum$ − = = 1  $\bf{0}$ 2 *n i*  $P = \sum p_i 2^i$ , //w is the bit amount of a digit 1:  $e[0] \leftarrow 2^n - D;$ 2:  $for (i = 1, i \leq w, i \leftarrow i + 1)$ 3: begin 4:  $e[i] \leftarrow (e[i-1] \times 2);$ 5: *if*  $e[i] \ge D$  *then* 6:  $e[i] \leftarrow (e[i] + e[0]) \mod 2^n;$ 7: end; 8:  $P \leftarrow 0$ : 9:  $for (i = \left| \frac{n}{w} \right|, i > 0, i \leftarrow i-1)$  $=\frac{n}{i}, i > 0, i \leftarrow i$ *w*  $for(i = \boxed{\frac{n}{n}}$ 10: begin 11:  $P \leftarrow X \times y_{iw-1} y_{iw-2} ... y_{iw-w} + P \times 2^w$ 12: *for*( $j = 0, j \leq w, j \leftarrow j + 1$ ) 13: begin 14: *if*  $p_{j+n}$  then

15: begin 16:  $P \leftarrow P - 2^{j+n} + e[j];$ 17: *if*  $p_n = 1$  *then*  $P \leftarrow P - 2^n + e[0];$ 18: end; 19: end; 20: end; 21: *if*  $P \ge D$  *then*  $P \leftarrow (P + e[0])$  mod *D*; 22:  $return(P)$ ;

Leong, Tan &Tan's Algorithm is suitable for software implementing. Even though they claim that the algorithm is a multi-precision algorithm, but it looks like a single precision. The algorithm does not reduce the time complexity significantly.

#### **Time Complexity**





The time spent on operating a multiplication is much larger than operating an addition. Then measuring the time complexity, we always neglect the addition part. Then, Leong, Tan &Tan's Algorithm needs only  $2n^2$  multiplications.

For a realistic estimate, the operation of Leong, Tan &Tan's Algorithm list in Table 3-6.

Table 3-4 The time complexity of Leong, Tan &Tan's Algorithm



#### **Storage Complexity**

Leong, Tan & Tan's Algorithm needs  $(w^2+2w)n'+w$  bits.

#### 3.2.2 b-ary number system

Let b be a positive integer and  $B = \{0,1,...b-1\}$ 

We denote Z, Q, R be the set of integers, the set of rational numbers, and the set of real numbers, respectively.

Any number r in Q can be represent by  $r = \sum_{i=d}^{s}$ *i d*  $r = \sum r_i b^i$  for each  $r_i$  in B.

We have  $d \ge 0$  *if*  $r \in Z$ ;  $d < 0$  *if*  $r \in Q - Z$ . Usually, we take

 $-\infty < d, s < \infty$ .

Let 
$$
D = \sum_{i=0}^{n-1} d_i b^i
$$
 and  $\frac{b}{2} \le d_{n-1} \le b - 1$   
\nLet  $X = \sum_{i=0}^{n-1} x_i b^i$  and  $Y = \sum_{i=0}^{n-1} y_i b^i$   
\nLet  $P = \sum_{i=0}^{n+1} p_i b^i$  and  $P_L = \frac{p_{n+1} \times b + p_n}{p_n \times b + p_{n-1}}$   
\nLet  $P' = \sum_{i=0}^{n} p_i b^i$  and  $P'_L = \frac{p_n \times b + p_{n-1}}{1896}$ 

3.2.3 Morita & Yang's Algorithm[6]

Algorithm Morita\_Yang(X, Y, D) //Input: X, Y and D are n digits, where  $0 \le X, Y \le D$ //Output:  $P = X \times Y \mod D$ . P is n+2 digits 1:  $P\bigoplus$ : 2: for j:= *n* −1 downto 0 do

3: begin

4: 
$$
P \leftarrow b \times P + X \times y_i;
$$

$$
5: \qquad q \leftarrow \left\lfloor \frac{P_L}{d_{n-1}+1} \right\rfloor;
$$

6:  $P \leftarrow P - b \times q \times D;$ 

7: 
$$
q \leftarrow \left\lfloor \frac{P_L}{d_{n-1}+1} \right\rfloor;
$$

8: if  $(q=1)$  then 9:  $P \leftarrow P - b \times D$ ; 10:  $if(q=2)$  then 11:  $P \leftarrow P - b \times (2 \times D)$ ; 12: if( $P'_L \ge (b-1) \times (d_{n-1}+1)$ ) then 13:  $P \leftarrow P - (b - 1) \times D$ ; 14: end; 15: if ( $P \geq D$ ) then 16: begin  $\left| \frac{P_L}{d} \right|$  $\mathsf I$  $q \leftarrow \left| \frac{P_L}{I} \right|$ ; 17:  $q \leftarrow \left| \frac{I_L}{d} \right|$ *L* ← + *d*  $_{n-1}$  + 1 ⎣ ⎦ 18:  $P \leftarrow P - q \times D$ ; 19: While  $(P \ge D)$  do 20:  $P \leftarrow P - D$ ; **AMARIA** 21: end; 22:  $return(P);$ Morita & Yang's Algorithm is suitable for implementing in

software for a low storage and computational power CPU. The advantages of the algorithm are:

- i) Using an approximation value of q. It can lower the execution time.
- ii) Using lookahead determinate approach. It can restrict the value in the small range, then it can guess an estimation value shortly.

Actually, the above two condition can also lower the running time and shorter the storage space.

#### **Time Complexity**

We should notice that n presents the number of b-ary digit.

The time spent on operating a multiplication is much larger than operating an addition. Then measuring the time complexity, we

always neglects the addition part. Then, Morita & Yang's Algorithm needs  $2n^2 + 2n$  multiplications.

For a realistic estimate, the operations of Morita & Yang's Algorithm list in Table 3-5.

Table 3-5 The time complexity of Morita & Yang's Algorithm



**Storage Complexity** 

Morita & Yang's Algorithm needs n+2 digits.



#### Chapter 4 A New Iterative Modular Multiplication Algorithm

We combine both of the algorithms described in the previous chapter to derive a new iterative modular multiplication algorithm. Our algorithm need less pre-computed table space than Leong, Tan & Tan's Algorithm and achieve the same performance like Morita & Yang's Algorithm. Furthermore, our proposed algorithm is better than Morita & Yang's Algorithm.

#### **4.1 The New Modular Multiplication Algorithm**

The proposed algorithm is given in Section 4.1.1 and its correctness is shown in Section 4.1.2

4.1.1 Procedure

Algorithm New Modular Multiplication $(X, Y, D)$ //Input: X, Y and D are n digits, *where*  $0 \le X, Y \le D$ //Output:  $P = X \times Y \mod D$ . P is n digits.

- 1:  $e \leftarrow Pre-computation(D)$ ;
- $2:$  $\widehat{P}$   $\leftarrow$  Approximate(X, Y, D, e);
- 3: *while*  $(\hat{P} \ge D)$  *do* // The length of  $\hat{P}$  is n+2 digits.
- 4:  $\hat{P} \leftarrow \hat{P} D$ ;
- 5:  $P \leftarrow \hat{P}$ ;
- 6: return(P);

Algorithm Pre-computation(D)

//Input:D is n digits

// Output:  $e[i] = i \times b^{n+1} \mod D$ ,  $1 \le i \le 6$ 

1:  $e[1] \leftarrow b^{n+1} \mod D;$ 

2:  $for (i = 2, i \le 6, i \leftarrow i + 1)$ 3: begin 4:  $e[i] \leftarrow (e[i-1] + e[1]);$ 5: *if*  $e[i] \ge D$  *then*  $e[i] \leftarrow e[i] - D$ ; 6: end;  $7:$  return(e);

Algorithm Approximation $(X, Y, D, e)$ // Input: X, Y and D are n digits, *where*  $0 \le X, Y \le D$ // Output:  $P = X \times Y \mod D$ . P is n+2 digits 1:  $P \leftarrow 0$ ; 2:  $for (i = n-1, i > 0, i \leftarrow i-1)$ 3: begin 4:  $P \leftarrow X \times y_i + b \times P;$ 5:  $q \leftarrow \left| \frac{I_L}{I_L} \right|$ ⎦  $\left| \frac{P'_L}{I} \right|$ ⎣  $\mathsf I$ + ←  $-1$ <sup>+1</sup> '  $n-1$ *L d*  $q \leftarrow \left| \frac{P}{\cdot} \right|$ 6:  $P' \leftarrow P'-q \times D;$ 7: *if*  $p_{n+1} > 0$  then  $P \leftarrow P + e[p_{n+1}]$ ; 8: end; 9: return(P);

The Algorithm Pre-computation calculates six pre-computed elements, i.e.  $b^{n+1}$ ,  $2 \times b^{n+1}$ ,  $3 \times b^{n+1}$ ,  $4 \times b^{n+1}$ ,  $5 \times b^{n+1}$ , and  $6 \times b^{n+1}$ mod D. The pre-computed elements store in e.

The Algorithm Approximation uses approximation method and pre-computed table to calculate  $P= X \times Y$  mod D. The approximation method uses  $\frac{I_L}{I_L}$ ⎦  $\left| \frac{P_L'}{A} \right|$ ⎣  $\mathsf I$  $_{-1}$  +1 '  $n-1$ *L*  $\left| \frac{P_L}{d_{n-1}+1} \right|$  instead of  $\left| \frac{P}{D} \right|$  $\vert$ *D*  $\left| \frac{P}{P} \right|$  to calculate quotient q. The approximation method reduces great amount of

computational time. Certainly, the approximate quotient is not

equal to the real quotient. The difference in value between the approximate value and the real value of quotient causes that P may great than D. We use the pre-computed table to limit P to 6D.

However, the output of Algorithm Approximation, i.e.  $\hat{P}$ , may still larger than D. The line 3 and 4 of Algorithm New\_Modular\_Multiplication truncate the  $\hat{P}$  to  $\hat{P} < D$ .

4.1.2 Correctness

Let  $\varepsilon = \left| \frac{I_L}{d} \right|$  $\overline{\phantom{a}}$  $\left|\frac{P'_L}{I_{L+1}}\right|$ ⎣  $\mathsf{I}$  $=\left[\frac{L}{d_{n-1}+1}\right]$ '  $n-1$ *L d*  $\mathcal{E} = \frac{P}{P}$ Let  $t = \frac{i=0}{h^{n-1}}$ 2 0 − −  $\sum_{i=0}$  $=\frac{i=0}{\hbar}$ *n i i i b p b*  $t = \frac{i=0}{r-1}$ , we have  $0 \le t < 1$ . we have  $P' = (P'_L + t) \times b^{n-1}$ ,  $E[S]$ and  $(P'_L+1) \times b^{n-1} > P' > P'_L \times b^{n-1}$  *from*  $0 \le t < 1$ *n L* Let  $a = \frac{i=0}{h^{n-1}}$ 2 0 − −  $\sum_{i=0}$  $=\frac{i=0}{h^n}$ *n i i i b*  $d<sub>i</sub>b$  $a = \frac{i=0}{i}$ , we have  $0 \le a < 1$ .

It implies  $D = (d_{n-1} + a) \times b^{n-1}$  and  $(d_{n-1} + 1) \times b^{n-1} > D \ge d_{n-1} \times b^{n-1}$ 1 1  $(d_{n-1} + a) \times b^{n-1}$  and  $(d_{n-1} + 1) \times b^{n-1} > D \ge d_{n-1} \times b^{n-1}$ −  $=(d_{n-1}+a)\times b^{n-1}$  and  $(d_{n-1}+1)\times b^{n-1} > D \ge d_{n-1}\times b^{n-1}$ *n*  $D = (d_{n-1} + a) \times b^{n-1}$  and  $(d_{n-1} + 1) \times b^{n-1} > D \ge d_{n-1} \times b$ 

For  $b, c, M \in \mathbb{Z}$ 

*b* (mod *M*) +  $c \equiv (b+c)$  mod *M* 

We consider *P* mod  $D = (p_{n+1} \times b^{n+1} + P')$  mod *D* 

 $\equiv p_{n+1} \times b^{n+1} \mod D + P' \mod D$ 

We compute  $p_{n+1}$  mod *D* by using table look-up and

compute *P*' mod *D* by using approximation method.

Theorem 1: 
$$
0 \le P - \left[ \frac{P'_L}{d_{n-1} + 1} \right] \times D < 5D
$$

Proof:

From  $P' \ge P'_{L} \times b^{n-1}$  and  $(d_{n-1} + 1) \times b^{n-1} > D$  $-1$  and  $(d + 1) \times h^{n-1}$ 1  $l \geq P'_1 \times b^{n-1}$  and  $(d_{n-1} + 1)$ 

We have 
$$
\frac{P'}{D} \ge \frac{P'_{L} \times b^{n-1}}{(d_{n-1} + 1) \times b^{n-1}} \ge \left[ \frac{P'_{L} \times b^{n-1}}{(d_{n-1} + 1) \times b^{n-1}} \right] = \varepsilon
$$

It implies *P*'−<sup>ε</sup> × *D* ≥ 0 ···················································· (1)

We have  $(P'_L+1) \times b^{n-1} > D$ 

$$
\Rightarrow \frac{(P'_{L}+1)}{d_{n-1}+a} > \frac{P'}{D} \quad \text{from} \quad D = (d_{n-1}+a) \times b^{n-1}
$$
\n
$$
\Rightarrow \frac{(P'_{L}+1)}{d_{n-1}+a} - \frac{P'_{L}}{d_{n-1}+1} > \frac{P'}{D} - \frac{P'_{L}}{d_{n-1}+1}
$$
\n
$$
Max\{\frac{(P'_{L}+1)}{d_{n-1}+a} - \frac{P'_{L}}{d_{n-1}+1}\} = \frac{P'_{L}+1}{d_{n-1}+a} - \frac{P'_{L}}{d_{n-1}+a} + \frac{P'_{L}}{d_{n-1}+1}
$$
\n
$$
\frac{b}{2} \times d_{n-1} \times b - 1; 0 \le P_{L} \times b^{2} - 1 \quad d_{n-1} + 1 = 0 \times b^{2} + \frac{b}{2} + 2b
$$
\n
$$
\text{We know } 0 < \frac{6}{b+2} < 1, \quad \text{if} \quad b \ge 4
$$
\n
$$
\text{Then } 3 < 4 - \frac{6}{b+2} < 4
$$
\n
$$
\text{Thus } 4 > \frac{(P'_{L}+1)}{d_{n-1}+a} - \frac{P'_{L}}{d_{n-1}+1}
$$
\n
$$
\Rightarrow 5 > \frac{(P'_{L}+1)}{d_{n-1}+a} - \frac{P'_{L}}{d_{n-1}+1}
$$
\n
$$
\text{From (2) and (3)}
$$

We have  $\frac{r}{2} - \frac{r}{1}$   $< 5$ 1 '  $\mid P' \mid P'$ 1  $\vert$  < ⎦  $\left| \frac{P_L'}{I} \right|$ ⎣  $\mathsf{L}$  $-\left[\frac{I_L}{d_{n-1}}\right]$ *L d P D P*

Thus, 
$$
P - \left[ \frac{P'_L}{d_{n-1} + 1} \right] \times D < 5D
$$

From  $(1)$  and  $(4)$ 

We have 
$$
0 \le P' - \left[ \frac{P'_L}{d_{n-1} + 1} \right] \times D < 5D
$$

In Algorithm Approximation we need six elements for out pre-computed table.



#### **4.2 Implement Issues**

The proposed algorithm is easy to be implemented. However, there are two conditions must to be handle. First, the quotient q is some times greater than one digit in the Algorithm Approximation. Then, there are no other instructions to compute  $b^{n+1}$  mod  $D$ directly in the Algorithm Pre-computation. We discuss the two issues at Section 4.2.1 and Section 4.2.2, respectively.

#### 4.2.1 Calculate  $P' \leftarrow P' - q \times D$  with two-digit q

Lemma 1: Let 
$$
q_1 \times b + q_0 =
$$
  
\nProof:  
\n
$$
q_1 \times b + q_0 = \left\lfloor \frac{P_L}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_1 = \left\lfloor \frac{p_n}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_2 = \left\lfloor \frac{P_L}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_3 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_4 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_5 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_6 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_7 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_8 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_9 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_1 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_2 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_3 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_4 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_5 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_6 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_7 = \left\lfloor \frac{P_R}{(d_{n-1} + 1)} \right\rfloor
$$
\n
$$
\Rightarrow q_8 = \left\lf
$$

We knew  $P'_{L} = p_{n} \times b + p_{n-1}$ ,  $0 \le P'_{L} \le b^{2} - 1$  and  $\frac{b}{2} \le d_{n-1} \le b - 1$ 

It implies  $0 \le p_n \le b-1$ 

2 <sup>1</sup> } <sup>2</sup> <sup>1</sup> { <sup>1</sup> <sup>1</sup> <sup>2</sup> <sup>0</sup> 1; <sup>1</sup> + <sup>−</sup> <sup>=</sup> <sup>×</sup> <sup>≤</sup> <sup>≤</sup> <sup>−</sup> <sup>≤</sup> <sup>−</sup> <sup>≤</sup> <sup>−</sup> <sup>−</sup> + *b b d <sup>p</sup> Max n n <sup>d</sup> <sup>b</sup> <sup>b</sup> pn b <sup>n</sup>* ······································ (6) 1 2 + *b* <sup>1</sup> <sup>0</sup> <sup>&</sup>lt; <sup>−</sup> <sup>≤</sup> *<sup>b</sup>* , for all *<sup>b</sup>* <sup>≥</sup><sup>1</sup>

From (5), we have  $q_1$  is an integer.

From (6), we have  $2 \times \frac{b-1}{b+2} \ge q_1$ +  $\times \frac{b-}{b}$ 

Thus,  $q_1 \leq 1$ 

Base on the LEMMA, we modify the Algorithm Approximation to Algorithm Approximation\_M.

Algorithm Approximation $_M(X, Y, D, e)$ // Input: X, Y and D are n digits, *where*  $0 \le X, Y \le D$ // Output:  $P = X \times Y \mod D$ . P is n+2 digits 1:  $P \leftarrow 0$ ; 2:  $for (i = n-1, i > 0, i \leftarrow i-1)$ 3: begin 4:  $P \leftarrow X \times y_i + b \times P;$  $q_1 \parallel q_0 \leftarrow \left| \frac{P}{q_0} \right|$  $\left| \frac{P_L'}{I} \right|$  $\mathsf{I}$  $|| q_0 \leftarrow \left| \frac{P'}{P} \right|$ 5:  $q_1 \parallel q_0 \leftarrow \left| \frac{I_L}{I_L} \right|$ ← *L*  $_1$  ||  $q_0$ *d*  $_{-1}$  + 1 + ⎣ ⎦ **SALLES** *n* 1 6: *if*  $q_1 > 0$  *then*  $P' \leftarrow P' - b \times D$ ; 7:  $P' \leftarrow P'-q_0 \times D;$ 8: *if*  $p_{n+1} > 0$  then  $P \leftarrow P' + e[p]$ 9: end; 10: return(P);

4.2.2 Calculating  $b^{n+1}$  mod D

There are no other instruction to compute  $b^{n+1}$  mod  $D$ directly. We use the approximation method to calculate  $b^{n+1}$  mod  $D$ .

Algorithm  $e1(D)$ //Input:D is n digits // Output:  $e[1] = b^{n+1} \mod D$ 1:  $e[1] \leftarrow b^{n+1}$ ;

2: 
$$
q \leftarrow \left[ \frac{e[1]_{n+1} || e[1]_{n} || e[1]_{n-1}}{d_{n-1} + 1} \right];
$$
  
\n3:  $e[1] \leftarrow e[1] - q \times D;$   
\n4: *while*  $(e[1] \geq D)$  *do*  
\n5:  $e[1] \leftarrow e[1] - D;$   
\n6: return(e[1]);

The line 2 of Algorithm e1 executes a three-digit dividend divided by one-digit divisor. We know that the first digit of e[1] is  $\frac{b}{2} \le d_{n-1} \le b-1$ . Therefore, the expression e[1]-b × D 1 and  $\frac{b}{2} \leq d_{n-1} \leq b-1$ operation will make the first digit of e[1] equal to 0. It means that only two-digit dividend divided by one-digit divisor is needed. Then, referred to Lemma 1, we modify the Algorithm e1 to Algorithm e1\_M. **SALE IS A LE** 

Algorithm e1\_M(D)  
\n//Input: D is n digits  
\n// Output: 
$$
e[1] = b^{n+1} \mod D
$$
  
\n1:  $e[1] \leftarrow b^{n+1}$ ;  
\n2:  $e[1] \leftarrow e[1] - b \times D$ ;  
\n3:  $q_1 || q_0 \leftarrow \left[ \frac{e[1]_n || e[1]_{n-1}}{d_{n-1} + 1} \right]$ ;  
\n4: if  $q_1 > 0$  then  $e[1] \leftarrow e[1] - b \times D$ ;  
\n5:  $e[1] \leftarrow e[1] - q_0 \times D$ ;  
\n6: while  $(e[1] \ge D)$  do  
\n7:  $e[1] \leftarrow e[1] - D$ ;  
\n8: return(e[1]);  
\nUsing Algorithm e1\_M, we can modify Algorithm

Pre-computation to Algorithm Pre-computation\_M.

Algorithm Pre-computation\_M(D) //Input:D is n digits // Output:  $e[i] = i \times b^{n+1} \mod D$ ,  $1 \le i \le 6$ 1:  $e[1] \leftarrow e1$  M(D); 2:  $for (i = 2, i \le 6, i \leftarrow i + 1)$ 3: begin 4:  $e[i] \leftarrow (e[i-1] + e[1]);$ 5: *if*  $e[i] \ge D$  *then*  $e[i] \leftarrow e[i] - D;$ 6: end;  $7:$  return(e);

#### 4.2.3 Complete Algorithm

Consider implement issues, we modify the Algorithm New\_Modular\_Multiplication to New\_Modular\_Multiplication\_M. E ES)

Algorithm New\_Modular\_Multiplication\_M(X, Y, D) //Input: X, Y and D are n digits, *where*  $0 \le X, Y < D$ //Output:  $P = X \times Y \mod D$ . *P* is n digits

1:  $e \leftarrow Pre-computation_M(D);$ 

- 2: *P*  $\hat{P} \in$ Approximate\_M(X, Y, D, e);
- 3: *while*  $(\hat{P} \ge D)$  *do* // The length of  $\hat{P}$  is n+2 digits.
- 4:  $\hat{P} \leftarrow \hat{P} D$ ;
- 5:  $P \leftarrow \hat{P}$ ;
- 6:  $return(P);$

#### **4.3 Special Case**

If  $d_{n-1} = b-1$  is used,  $d_{n-1} + 1 = b$ . As a result, no more division operation is needed and no more  $q_1$  exist, nether. In this condition, the Algorithm Approximation\_M become Algorithm Approximation\_SPC.

Algorithm Approximation\_SPC(X, Y, D, e) // Input: X, Y and D are n digits, *where*  $0 \le X, Y \le D$ // Output: P=  $X \times Y$  mod D. P is n+2 digits 1:  $P \leftarrow 0$ ; 2:  $for (i = n - 1, i > 0, i \leftarrow i - 1)$ 3: begin 4:  $P \leftarrow X \times y_i + b \times P;$ 5:  $q \leftarrow p_n;$ 6:  $P' \leftarrow P'-q_0 \times D;$ 7: *if*  $p_{n+1} > 0$  then  $P \leftarrow P + e[p_{n+1}]$ ; 8: end; 9: return(P);

#### **4.4 Complexity**

We discuss our new algorithm in general purpose CPU and the TI MS3200 55x' DSP, respectively.

#### 4.4.1 General Purpose CPU

Usually executing a multiplication is much longer than executing a addition. Then, we only consider time complexity of running the multiplications. The other assumption is that a 2-digit dividend divided by 1-digit divisor need about the same clock cycles as a multiplication operation. The loop repeats n times and each loop performs  $2n+1$  times multiplication. Totally, the new algorithm will take  $2n^2+n$  times of multiplication. In the special case, our algorithm needs only  $2n^2$  times of multiplication.

# 4.4.2 TI MS3200C 55x' DSP<sup>2</sup>

Based on Fig. 4-9 the modified of Proposed Algorithm, the total computational effort is list in Table 4-1.

Table 4-1 The computational effort of Proposed Algorithm

1896



#### 4.4.3 Storage Complexity

The new proposed algorithm required  $7n+2$  digits of b-ary numbers.

#### **4.6 Compare with Other Algorithms**

Base on table 2-2, and the estimated clock cycles consumption results of chapter 3, we derive table 4-2. In this table, we assume n is 1024, w is 16.

Table 4-2 Comparing with other iterative algorithm by weighted clock cycles consumption

| <b>Modular Multiplication Algorithm</b> | ideal             | real               |
|-----------------------------------------|-------------------|--------------------|
| <b>Traditional Algorithm</b>            | $8.650 \text{ n}$ | 11.050n            |
| <b>Blakely's Algorithm</b>              | 8.175 n           | 9.975 <sub>n</sub> |
| Chiou & Yang's Algorithm 1              | $2.175 \text{ n}$ | 3.975 <sub>n</sub> |
| Chiou & Yang's Algorithm 2              | $1.700 \text{ n}$ | 2.900 <sub>n</sub> |
| Morita & Yang's Algorithm               | $0.366$ n         | 0.590n             |
| Leong, Tan & Tan's Algorithm            | $3.921 \text{ n}$ | 8.797 <sub>n</sub> |
| New Proposed Algorithm (best case)      | $0.273$ n         | 0.694n             |
| New Proposed Algorithm (average case)   | $0.334$ n         | 0.559n             |
| New Proposed Algorithm (worst case)     | $0.394$ n         | 0.461n             |

In average, our proposed algorithm will take the same time as Morita & Yang's Algorithm. In special case, new proposed algorithm will be better than Morita & Yang's Algorithm.

#### Chapter 5 Conclusion

In this thesis, we study the architecture of TI TMS320c55x' DSP and its instruction set. We show that only consider multiplication as the main computational complexity is not enough. And also, we derive a evaluation model to estimate the computational complexity of modular modulation algorithm for TI TMS320c55x' DSP.

Moreover, we proposed a new iterative modular multiplication algorithm. The new proposed algorithm can achieve the same performance as Morita & Yang's Algorithm in average, and have better performance in certain cases.

## **Future work**



Our modular multiplication evaluation model is based on TI TMS320c55x' DSP, now. In the feature we could extend the evaluation model to other new proposed TI DSP family.

## **REFERENCES**

- [1] C.W. Chiou, "Parallel Implementation of the RSA Public-Key Cryptosystem" Intern. J. Computer Math., vol. 48, pp135-155,1993.
- [2] C.W. Chiou & T.C. Yang, "Iterative modular multiplication algorithm without magnitude comparison" Electronics Letters, vol. 30, no.24, pp2017-2018, 1994.
- [3] D.E. Knuth, "The Art of Computer Programming" Addition-Welsey, 2nd, 1981.
- [4] "FIPS PUB 186-2: Digital Signature Standard (DSS)", NIST, 2000.
- [5] G.R Blakley, "A computer Algorithm for Calculating the Product AB Modulo M" IEEE Transaction On Computer, vol. c-32. no. 5, pp497-500, 1983.
- [6] H. Morita & C.H. Yang, "A Modular-Multiplication Algorithm Using Lookahead Determination" IEICE Trans. Fundamentals, vol.E76-A, no.1, 1993.
- [7] J.L. Hennessy & D.A. Patterson "Computer Architecture—A Quantitative Approach" Morgan-Kaufmann, 2nd, 1995.
- [8] P.C. Leong, E.C. Tan & P.C. Tan, "An Iterative Modular Multiplication Algorithm", Computers and Mathematics with Applications vol.44, pp175-180, 2002.
- [9] R. Rivest, A. Shamir, and L. Adleman, ″A method for obtaining digital signature and public-key cryptosystems″, Commun. of ACM, vol.21, no.2, pp.120-126, 1978.
- [10] S.Y. Yan, "Number Theory for Computing", Springer, 2000.
- [11] "SPRU374 TMS320C55x DSP Mnemonic Instruction Set Reference Guide", Texas Instruments, 2001.
- [12] "SPRU376 TMS320C55x DSP Programmer's Guide", Texas Instruments, 2000.
- [13] "SPRU393 TMS320C55x Technical Overview", Texas Instruments, 2000.
- [14] T. ElGamal, "A public Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms" IEEE Trans. On Info. Theory, vol.31, pp469-472, 1985.
- [15] W. Diffie & M. Hellman, "New Directions in Crytography" IEEE Trans. On Info. Theory, vol.22, no.6, pp637-647,1976.

