## Computer Arithmetic:

Go forth and multiply

Hossam A. H. Fahmy
(C) Hossam A. H. Fahmy

## Loop on $Y$

while ( $\mathrm{Y}>0$ ) \{product $=$ product $+\mathrm{X} ; \mathrm{Y}=\mathrm{Y}-1 ;$ \}

- This is the simplest implementation.
- Feasible in both hardware and software.
- If $Y$ has $n$ bits this algorithm takes up to an $\mathcal{O}\left(2^{n}\right)$ steps.
$\Rightarrow$ Slow and with variable latency
- For integers

$$
\text { Product }=\underbrace{X+X+\cdots+X}_{Y \text { times }}
$$

- All the computer numbers are represented as integers.
- Depending on the time and resources allowed for this operation, several implementations are possible.


## Loop on the bits of $Y$

The add and shift method examines the bits of $Y$.

1. If the bit $Y[0]=1$, add $X$.
2. Shift both the product and $Y$ to the right one bit.
3. Repeat for the $n$ bits of $Y$.

Fixed latency of $\mathcal{O}(n)$.

Booth proposed an algorithm based on the fact that

$$
\begin{array}{ll}
\text { a string of ones } & \cdots 011 \cdots 110 \cdots \\
\text { is equal to } & \cdots 100 \cdots 010 \cdots
\end{array}
$$

Instead of adding repeatedly, add only twice. The recoding is simple:

1. On a transition from 0 to 1 , put $\overline{1}$ at the location of the 1 .
2. On a transition from 1 to 0 , put 1 instead of the 0 .
3. Put zeros at all the remaining locations. (i.e. skip groups of zeros and groups of ones.)

- It has a variable latency but, on average, the use of the Booth algorithm reduces the time delay.
- The worst case is $(01010101 \cdots=1 \overline{1} 1 \overline{1} 1 \overline{1} 1 \overline{1}) \Rightarrow \mathcal{O}(n)$ delay
- Sequential: using any of the methods mentioned so far.
- Parallel:

1. Simultaneous generation of all the partial products.
2. Parallel reduction of the partial products to two bit vectors.
3. A final Carry Propagate Adder (CPA).

- Iterative: not fully parallel and not fully sequential.

PP generation using ROMs


Implementation of $8 \times 8$ unsigned multiplication using four $256 \times 8$ ROMs, where each ROM performs $4 \times 4$ multiplication.

- For a parallel implementation, we want a fixed number of partial products. A smaller number is considered better.
- The original Booth algorithm recodes the number in a redundant format ( $\overline{1}, 0,1$ ) by scanning overlapped groups of 2 bits. The worst case is $n$ PPs.
- If we scan 2 new bits in each group and use the set $\{\overline{2}, \overline{1}, 0,1,2\}$ we get almost $\frac{n}{2}$ PPs.

| Original bits |  |  |  |  |  | Booth recoding |  |  | Value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $y_{j+1}$ | $y_{j}$ | $y_{j-1}$ | $y_{j+1}$ | $y_{j}$ | $y_{j-1}$ |  |  |  |  |
| 0 | 0 | 0 | 0 | 0 | 0 | +0 |  |  |  |
| 0 | 0 | 1 | 0 | 1 | 0 | +1 |  |  |  |
| 0 | 1 | 0 | 1 | 1 | 0 | +1 |  |  |  |
| 0 | 1 | 1 | 1 | 0 | 0 | +2 |  |  |  |
| 1 | 0 | 0 | 1 | 0 | 0 | -2 |  |  |  |
| 1 | 0 | 1 | 1 | 1 | 0 | -1 |  |  |  |
| 1 | 1 | 0 | 0 | 1 | 0 | -1 |  |  |  |
| 1 | 1 | 1 | 0 | 0 | 0 | -0 |  |  |  |

$y_{j-1}=1$ indicates that it is a string of ones.

Let us think that we are converting from a non-redundant representation to a redundant one.

For each group of two (new) bits we generate a value and a possible carry into the next higher group.

| Original bits |  |  |  |  |
| :---: | :---: | :---: | :---: | ---: |
| $y_{j+1}$ | $y_{j}$ | $c_{\text {in }}$ | $c_{\text {out }}$ | value |
| 0 | 0 | 0 | 0 | +0 |
| 0 | 0 | 1 | 0 | +1 |
| 0 | 1 | 0 | 0 | +1 |
| 0 | 1 | 1 | 0 | +2 |
| 1 | 0 | 0 | 1 | -2 |
| 1 | 0 | 1 | 1 | -1 |
| 1 | 1 | 0 | 1 | -1 |
| 1 | 1 | 1 | 1 | -0 |

We choose $c_{\text {out }}$ and the value in this particular manner to make $c_{o u t}=y_{j+1}$ and hence reduce the logic gates needed to generate it.

The values chosen are also easy to generate by a simple shifting of $X$ or $-X$.

8/15

## The price of Booth 2

- In a direct multiplier, $P P_{j}=X y_{j}=\sum_{i=0}^{i=n-1} x_{i} y_{j} 2^{i}$. An array of AND gates is enough.
- In a Booth 2 multiplier, some time is used for recoding and for selecting the correct multiplicand multiple.

We have less PPs but we spend some time, area, and power:

1. to recode the bit string into the redundant form,
2. to select the correct multiple of the multiplicand using a multiplexer whose inputs are $0, X, 2 X$, and their negatives, and
3. to sign extend the PPs since some of them might be negative although we are sure that for unsigned numbers the final product is positive.

In the modified version, we produce $\left(\frac{n}{2}+1\right)$ PPs
Algorithm for unsigned numbers:

1. Pad the $\angle S B$ with one 0 .
2. Pad the MSB with two 0 if $n$ is even and one 0 if $n$ is odd.
3. Divide the multiplier into overlapping groups of 3-bits.
4. Determine the scale factor from the recoding table.
5. Select the multiplicand multiples.
6. Sum the partial products.

## Signed multiplication

If only the multiplicand is signed and represented in 2's complement the algorithm works fine. However, for a signed 2's complement multiplier we need yet another modification:

1. Pad the $L S B$ with one 0 .
2. If $n$ is even do not pad the MSB ( $\frac{n}{2}$ PPs) and if $n$ is odd pad the MSB with one $0\left(\frac{n+1}{2}\right.$ PPs).
3. Divide the multiplier into overlapping groups of 3-bits.
4. Determine the scale factor from the recoding table.
5. Select the multiplicand multiples.
6. Sum the partial products.

- To get $-X$ or $-2 X$ invert each bit and add one at the $L S B$.
- If we use Booth 2 as just described, we need this inversion only if $y_{j+1}=1$ (note that $-0=111 \cdots 111+1$ ).
- We are sure that the $L S B$ location is empty in the lower PPs.

| $a_{9}$ | $a_{9}$ | $a_{9}$ | $a_{9}$ | $a_{9}$ | $a_{9}$ | $a_{9}$ | $a_{8}$ | $a_{7}$ | $a_{6}$ | $a_{5}$ | $a_{4}$ | $a_{3}$ | $a_{2}$ | $a_{1}$ | $a_{0}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $b_{9}$ | $b_{9}$ | $b_{9}$ | $b_{9}$ | $b_{9}$ | $b_{8}$ | $b_{7}$ | $b_{6}$ | $b_{5}$ | $b_{4}$ | $b_{3}$ | $b_{2}$ | $b_{1}$ | $b_{0}$ |  | $y_{1}$ |
| $c_{9}$ | $c_{9}$ | $c_{9}$ | $c_{8}$ | $c_{7}$ | $c_{6}$ | $c_{5}$ | $c_{4}$ | $c_{3}$ | $c_{2}$ | $c_{1}$ | $c_{0}$ |  | $y_{3}$ |  |  |
| $d_{9}$ | $d_{8}$ | $d_{7}$ | $d_{6}$ | $d_{5}$ | $d_{4}$ | $d_{3}$ | $d_{2}$ | $d_{1}$ | $d_{0}$ |  | $y_{5}$ |  |  |  |  |
| $e_{7}$ | $e_{6}$ | $e_{5}$ | $e_{4}$ | $e_{3}$ | $e_{2}$ | $e_{1}$ | $e_{0}$ |  | $y_{7}$ |  |  |  |  |  |  |
| $p_{15}$ | $p_{14}$ | $p_{13}$ | $p_{12}$ | $p_{11}$ | $p_{10}$ | $p_{9}$ | $p_{8}$ | $p_{7}$ | $p_{6}$ | $p_{5}$ | $p_{4}$ | $p_{3}$ | $p_{2}$ | $p_{1}$ | $p_{0}$ |

12/15
Since $(s s s \cdots s s s) \bmod _{2^{n}}=(111 \cdots 111+\bar{s}) \bmod _{2^{n}}$ then we can reduce the needed summation to

## PP generation conclusions

- The PPs are independent and it is possible to generate them al in parallel.
- Reducing the number of PPs decreases the cost of their summation but increases that of their generation.

