Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 3

Fixed point math The Official Theory Guide IGAD 2008

By Jacco Bikker

A number can be represented in integer format as [X:Y], where: X is the number of bits available for the whole number; Y is the number of fractional bits. The whole number may include a sign bit, in which case X - 1 bits are available for the (positive or negative) whole number. Using the sign bit does not affect the fractional part, since this cannot be negative. The range of a fixed point value is the maximum (rounded) value that you can store in the bits reserved for the whole number. The precision of a fixed point value is the smallest number that you can store in the fractional bits. Examples: [16:16] signed fixed point lets you store -32767 .. 32767 (give or take one). The number of (decimal) digits you can use is thus smaller than 5 (because 40000 doesnt fit) but larger than 4. [22:10] unsigned fixed point lets you store a value of 1/1024, but not 1/2048. The precision (in decimal digits) is thus ~3. Effect of common operations on range and precision: [X:Y] + [X:Y] = [X + 1:Y], because 1023 takes 10 bits, but 1023 + 1023 requires 11 bits. [X:Y] - [X:Y] = [X + 1:Y], because -1023 takes 10 bits, but -1023 - 1023 requires 11 bits. Note that this may differ if you know that the result cannot exceed a certain number of bits (e.g., one of the operands actually is [X - 1:Y] ). [X:Y] * [Z:W] = [X + Z : Y + W] Example: Multiplying a 16:16 fixed point number by another 16:16 fixed point number yields a 32:32 result (if you dont do anything to prevent this). [X:Y] / [Z:W] = [X - Z : Y - W] Note that if X = Z, the division will basically give you a [0:0] number, which is probably not the desired result, unless you are only interested in the sign bit (for signed operations). Implementing multiplication using fixed point: In general, neither the source operands nor any intermediate result nor the final result may exceed the available number of bits. If the source operands are [16:16] and the available number of bits is 32, this requires manual intervention. The most straightforward way is to use a 64bit intermediate result: [16:16] * [16:16] = [32:32] Note however that the whole number is now 2^16 times too large. Bit-shifting this value to the right by 32 gives the desired result, [16:16]. This result fits in 32 bits and has maximal precision. The use of an intermediate 64bit value is often a cpu-intensive option. Often, the calculation may be performed in 32bit instead.

Multiplication without the use of a 64bit intermediate value: If the operands (stored as [16:16]) are actually [8:16] values (i.e., the range is not fully used), a decent approximation can be obtained by pre-shifting the operands: [8:16] >> 8 = [8:8]; [8:8] * [8:8] = [16:16]. Similarly, actual operands using [10:16] may be multiplied as follows: [10:16] >> 10 = [10:6] [10:6] * [10:6] = [20:12] = [16:16]. Note that the precision of the input parameters is significantly reduced. Also note that often the two parameters do not have the same range and precision requirements, in which case careful tuning is required. Division using fixed point: Similar to multiplication, division can be performed using 64bit intermediate values: [16:16] << 16 = [16:32] (or, using 64bits, [32:32], where the range uses only 16 bits). [32:32] / [16:16] = [16:16]. Division without the use of a 64bit intermediate value: By dividing by a smaller number, the problem of a [0:0] result can be prevented. Example: [16:16] / [8:8] = 8:8 The final result can then be shifted to the left to get a [8:16] answer. In general, divisions require very careful analysis of range and precision requirements of both operands (and the result!). Square root, powers, trigoniometry One should in general not attempt to implement these functions in fixed point. For square root, excellent (efficient) approximations can be found on the internet, often more efficient than the corresponding floating point version. For other complex operations, you can revert temporarily to floating point (and then convert the result back to fixed point), or you can use look-up tables (extended with interpolation, if needed). A note on precision Pre-shifting for calculations involves rounding: [16:16] >> 8 = [16:8], but the lowest 8 bits are simply dropped. This is similar to decimal rounding by simply scrapping numbers; a better approximation is obtained by doing proper rounding. In decimal: (int)10.4 = 10 (int)10.6 = 11 (int)10.6 = (floor)(10.6 + 0.5) = 11. So adding 0.5 prior to doing a hard truncation yields the correct result. In binary: [110011:0111] >> 4 = 110011 [110011:1111] >> 4 = 110011, but 110100 would have been more accurate. Just like for the decimal numbers, we therefore add 0.5 prior to rounding:

[110011:1111] + [000000:0100] = [110100:0011] (use the windows calculator to verify). The 0.5 value in fixed point arithmetic depends on the number of fractional bits. In the above example, 4 bits where used for the fractional part, so a maximum value of 241 = 15 can be stored. 0.5 is then 23, which is 8 in this case. Adjusting values before rounding doubles precision, which can be significant in some situations. Note however that a significant amount of extra operations is required. In practice, this is done only in situations where every last bit counts. And thats probably the essence of fixed point arithmetic: Bits count. ---------------------------------------------------------------------Practise: Write down the best precision and range trade-off for the operands, intermediate values and final result in the following calculations. Assume that the operands are real numbers in the specified ranges. A + B, where A = { 0 .. 513 } and B = { -100 .. 100 } A * B, where A = { 0 .. 0.5 } and B = { 0 .. 5.5 } 1 / B, where B = { 0 .. infinity } A / B, where A = { -0.5 .. 0.5 } and B = { 0 .. infinity } (A * A) + (A * B), where A = { -1000 .. 1000 } and B = { 0 .. 3 } (A + B) / (A - B), where A, B = { -80000 .. 80000 }

You might also like