## Computer Arithmetic

Consider a computer that uses 20-bit floating point numbers of the form

with a 1-bit sign indicator,a 7-bit exponent,and a 12-bit mantissa,stored as binary numbers. The most significant bit of the mantissa must be 1.is a bias subtracted from n to

represent both positive and negative exponents.

Note thatfor positive numbers andfor negative numbers and the maximum value of the 7-bit exponent isi.e.

The length of the exponent controls the range of numbers that can be represented. To ensure

however that numbers with small magnitude can be represented as accurately as numbers with

large amplitude, we subtract the biasfrom the exponentThus, the effective

range of the exponent is notbut

The minimum value ofand its maximum value isThus,

The absolute value of the largest oating point number that can

be stored in the computer isComputations involving larger numbers, e.g.produce an overow error.

The smaller absolute number that can be stored isSimilarly computations involving smaller numbers, e.g.produce an underflow error.

Consider the number represented by

Sign | Exponent | Mantissa |

0 | 1001001 | 110100010011 |

that is

The sign indicator is 0, i.e. the number is positive.

The exponent isso the effective exponenti.e.

The mantissa gives

So, the machine number represents

The next floating point number that we can store in this machine is

Sign | Exponent | Mantissa |

0 | 1001001 | 110100010100 |

The sign and the exponent remain unchanged and we simply add 1 to the least significant bit of the mantissa. The new number isso our primitive computer would be unable to store exactly any number between 836.75 and 837, leading to a relative uncertainty equal to

At worst, the relative uncertainty in the value of floating point numbers that this primitive computer can store is equal to

Suppose that we perform a calculation to which the answer is

There are two ways to approximate this:

1. the most accurate is rounding to the nearest floating point number,

2. Many computers simply chop off the expression at the bit length of the mantissa

and ignore the extra digits, giving an answer of