Thursday, September 27, 2007

Introduction To Fixed Point Math

= Introduction to Fixed Point Math =
9/14/2004

Back in the olden days of dragons and sorcerers, before floating point coprocessors, computers could only directly manipulate whole numbers (integers). Things like decimal points were new fangled technologies only used by crazy mathematicians with slide rules and pen and paper. To overcome this deficiency crufty old-school programmers, many of whom today are over 30 years old, had to resort to various hacks and tricks in order to represent wacky numerical concepts like "0.5". The most common technique was fixed point math, which is the topic of this article.

== Preamble ==

I'm eventually, maybe, going to write a bunch of stuff here about real numbers and integers and number theory and domains and what not, but right now that doesn't interest me so I'm going to dodge the issue and do a quick summary for those late to class:

* computers represent integers and real numbers differently
* not all computers can perform math on real numbers quickly
* computers that do perform math on real numbers typically use a floating point representation which dynamically adjusts range and precision depending on the inputs to a particular operation.
* converting from a floating point value to an integer value is often very slow and always inext

Hope you're caught up. Next.

== Fixed Point Representation ==

Given that there are processors out there that don't support floating point numbers, how do we represent something like "0.5" on such beasts? This is where fixed point math comes into play. As the name implies, a fixed point number places the "decimal" point between the whole and fractional parts of a number at a fixed location, providing f bits of fractional precision. For example, an 8.24 fixed point number has an eight bit integer portion and twenty four bits of fractional precision. Since the split between the whole and fractional portion is fixed, we know exactly what our range and precision will be.

So let's think about this for a second. Using a 16.16 fixed point format (which, handily enough, fits in a typical 32-bit integer in C), the high 16-bits represent the whole part, and the low 16-bits represent the fractional portion:

{{{
0xWWWWFFFF
}}}

With 16-bits in the whole part, we can represent 2^16^ (65536) discrete values (0..65535 unsigned or -32768 to +32767 signed). The fractional part gives us, again, 2^16^ steps to represent the values from 0 to almost 1 -- specifically, 0 to 65535/65536 or approximately 0.99999. Our fractional resolution is 1/65536, or about 0.000015 -- that's pretty good resolution as well.

Let's look at a concrete example: the 16.16 fixed point value {{{0x00104ACF}}} approximately equals the decimal value 16.29222. The high sixteen bits are 0x10 (16 decimal) and the low 16-bits are 0x4ACF, so 0x4ACF/65536.0 ~= 0.29222. The simple method to convert from a fixed point value of ''f'' fractional bits is to divide by 2^f^ so:

http://www.bookofhook.com/Images/Fixed_Point_Images/fixrep_1.gif

However, we have to be careful since these integers are stored in 2's complement form. The value -16.29222 is 0xFFEFB531 -- notice that the fractional bits (0xB531) are not the same as for the positive value. So you can't just grab the fractional part by masking off the fractional bits -- sign matters.

== 2's Complement Summary ==

If you're reading this document there's a chance that you're not familiar with the concept of 2's complement arithmetic, which is fairly vital to understanding fixed point math. Fundamentally 2's complement solves the problem of representing negative numbers in binary form. Given a 32-bit integer, how can we represent negative numbers?

The obvious method is to incorporate a sign bit, which is precisely what floating point numbers in IEEE-754 format do. This signed magnitude form reserves one bit to represent positive/negative, and the rest of the bits represent the number's magnitude. Unfortunately there are a couple problems with this.

1. Zero has two representations, positive (0x00000000) and negative (0x80000000), making comparisons cumbersome. It's also wasteful since two bit configurations represent the same value.
2. Adding a positive integer to a negative integer requires logic. For example, -1 (0x80000001 in 32-bit sign magnitude form) summed with +1 (0x00000001 in 32-bit sign magnitude form) yields 0x8002, or -2 in 16-bit sign-bit form..

Some really smart guys came up with a much better system a long time ago called 2's complement arithmetic. In this system positive numbers are represented as you'd expect, however negative numbers are represented by their complement, plus one. This requires taking all bits in the positive representation, reversing them, then adding one.

So to encode -1 we take {{{0x00000001}}}, invert (complement) the bits to get {{{0xFFFFFFFE}}}, and then add 1, giving us {{{0xFFFFFFFF}}} -- which you may recognize as the hex form for -1. This gives us a range of +/- 2^(n-1)^ where n is the number of storage bits.

2's complement solves a host of problems.

1. Zero has a single representation, simplifying comparisons and removing the wasted value.
2. Addition and subtraction just work. A binary addition of 2's complement values will give the correct result regardless of sign. -1 +1 = 0 ({{{0xFFFFFFFF + 0x00000001 = 0x00000000}}}); -1 + -1 gives us -2 ({{{0xFFFFFFFF + 0xFFFFFFFF = 0xFFFFFFFE}}}); and 1 + 1 = 2 ({{{0x00000001 + 0x00000001 = 0x00000002}}}).

But...it introduces a couple unintuitive bits. Decoding a negative number takes a little bit of thought (you have to do the complement and add one), and the magnitude of a negative number is inversely proportional to the magnitude of its binary representation: -1 = {{{0xFFFFFFFF}}} but -2147483648 = {{{0x80000000}}}. This has at least one important ramification when dealing with fixed point -- the fractional bits of a fixed point number are not interpreted the same for positive and negative values.

Anyway, that's a quick refresher course. If you really need to know more about 2's complement, Google is your friend or, if you're feeling gung-ho, find a good book on discrete math or numerical analysis.

== Range and Precision Trade Offs ==

The whole trick with fixed point is that you're sacrificing some range in exchange for precision. In this case, instead of a 32-bit integer you have a 16-bit integer space with 16-bits of fractional precision.

Different operations often require different amounts of range and precision. A format like 16.16 is good for many different things, but depending on the specific problem you may need highly precise formats such as 2.30, or longer range formats such as 28.4. For example, we know that the sine function only returns values in the range -1.0 to +1.0, so representing sine values with a 16.16 is wasteful since 14-bits of the integer component will always be zero. Not only that, but the difference between sine(90) and sine(89.9) is a very small number, ~.0000015230867, which our 16-bits of fractional precision cannot represent accurately. In this particular case we'd be much better off using a 2.30 format.

Conversely, some calculations will generate very large numbers, sometimes unexpectedly. The area of a screen sized triangle can easily exceed 2^16^ (a triangle half the size of a 640x480 pixel screen covers 153600 pixels), so even though the individual coordinates of such a triangle might be well within what we'd intuitively believe to be acceptable range (e.g. (0,0),(639,0),(639,479)), the area formed can vastly outstrip our available integer bits.

This brings to light a serious problem with fixed point -- you can silently overflow or underflow without expecting it, especially when dealing with sums-of-products or power series. This is one of the most common sources of bugs when working on a fixed point code base: some set of values that you expect to stay within range rapidly overflow when you least expect it.

A very common case is vector normalization, which has a sum-of-products during the vector length calculation. Normalizing a vector consists of dividing each of the vector's elements by the length of the vector. The length of a vector is:

http://bookofhook.com/Images/Fixed_Point_Images/vector_length.gif

One might believe that the vector (125,125,125) can be easily normalized using a 16.16 signed fixed point format (-32K to +32K range)...right? Unfortunately, that's not the case. The problem is that the sum of those squares is 46875 -- greater than our maximum value of +2^15^ (32767). After the square root it would drop down to a very manageable value (~216.51), but by then the damage is done. The length calculation will fail, and thus the normalization will fail -- but the scary thing is that the vector normalization will work correctly with a lot of values, but not with others. The vector (100,100,100) will normalize just fine, but the vector (40,40,180) will not, but you can't tell from a glance which is fine and which isn't. Fixed point math is highly input sensitive.

This is why choosing your range and precision is very important, and putting in a lot of range and overflow sanity checks is extremely important if you want robust software.

== Conversion ==

Converting to and from fixed point is a simple affair if you're not too worried about rounding and details like that. A fixed point value is derived from a floating point value by multiplying by 2^f^. A fixed point value is generated from an integer value by shifting left ''f'' bits:

http://bookofhook.com/Images/Fixed_Point_Images/fixops_1.gif

{{{
F = fixed point value
i = integer value
r = floating point value
f = number of fractional bits for F
}}}

Converting back to floating point is trivial -- just perform the reverse of the float-to-fixed conversion:

http://bookofhook.com/Images/Fixed_Point_Images/fixops_2.gif

However, converting to an integer is a bit trickier due to rounding rules and the 2's complement encoding. In the world of floating point there are four common methods for converting a floating point value to an integer:

1. round-towards-zero
2. round-towards-positive-infinity
3. round-towards-negative-infinity
4. round-to-nearest

(Note: there are other rounding methods as well, such as round-to-even, round-to-odd, round-away-from-zero, etc. but those are rarer and I don't feel like devoting space to them). These are conceptually simpler in floating point than in fixed point -- for example, truncating a floating point value always rounds towards zero, but truncating the fractional bits of a fixed point number actually rounds towards negative infinity (due to the miracle of 2's complement, where the fractional bits move you away from 0 for positive numbers but towards 0 for negative numbers).

=== Round-Towards-Zero ===

Rounding towards zero "shrinks" a number so that it always gets closer to zero during the rounding operation. For a positive fixed point value a simple truncation works, however a negative number needs to have 1<<(''f''-1) added to it before the truncation. This is just "round towards negative-infinity if positive; round towards positive-infinity if negative".

=== Round-Towards-Negative-Infinity ===

Rounding towards negative-infinity is the simplest operation, it's the truncation we talked about earlier. A positive number that loses its fractional goes from A.B to A.0, and will always get closer to zero. However, negative numbers get larger as they approach 0 due to 2's complement encoding. This means that removing the fractional bits actually pushes the negative number away from zero (and thus towards negative infinity).

This can be somewhat confusing, but an example illustrate this. The fixed point value 0xFFFF9999 represents -0.40000 roughly. By truncating the fractional bits the new value is -1.0 -- we've rounded towards negative infinity. The fixed point value 0xFFFF0003 represents a value just above negative -1, so truncating the fractional bits will, again, give us a value of -1.

Note that we need to perform an arithmetic shift, also known as a sign-extended shift, to ensure that the sign bit is propagated correctly for negative numbers. Consider the 16.16 fixed point value for -1 ({{{0xFFFF0000}}}). If we do an unsigned shift to the right by 16 places we have 0x0000FFFF which would be 65535, which is incorrect. By sign extending the shift the result will be {{{0xFFFFFFFF}}}, which is -1 in 2's complement integer representation.

http://bookofhook.com/Images/Fixed_Point_Images/fixops_3.gif

=== Round-Towards-Positive-Infinity ===

Rounding towards positive infinity requires adding the largest value under 1 ((1<
http://bookofhook.com/Images/Fixed_Point_Images/fixops_4.gif

=== Rounding-to-Nearest ===

Rounding to the nearest value, which is what we often associate with rounding (this is what the C language does), is easy primarily because of the 2's complement encoding. By adding "one-half" (1<<(''f''-1)) before the shift we will always round to the nearest value. When a fixed point value is positive, this adds one- half away from 0. When a fixed point value is negative, this add one-half towards 0, so it adds +0.5 no matter what.

http://bookofhook.com/Images/Fixed_Point_Images/fixops_5.gif

=== Rounding Summary ===

The following table lists some example values and how they are rounded with the different rounding modes:
{{{
Input Value Round to Output
-1.2 zero -1
-1.2 nearest -1
-1.2 positive infinity -1
-1.2 negative infinity -2
-1.7 zero -1
-1.7 nearest -2
-1.7 positive infinity -1
-1.7 negative infinity -2
1.2 zero 1
1.2 nearest 1
1.2 positive infinity 2
1.2 negative infinity 1
1.7 zero 1
1.7 nearest 2
1.7 positive infinity 2
1.7 negative infinity 1
}}}

Which rounding mode you use is entirely up to you and your application, just make sure you understand the ramifications of each mode.

== Numerical Operations ==

=== Addition and Subtraction ===

Adding and subtracting two fixed point numbers of the same scale is simplicity itself -- just perform a standard integer addition or subtraction. It just works -- mathematically this is obvious since:

http://bookofhook.com/Images/Fixed_Point_Images/fixops_6.gif

If the two numbers are operating in different scales, then you will have to shift one or the other (or both) so that their "decimal points" line up. How you do this depends entirely on the final output desired. For example, adding a 16.16 to a 28.4 could be accomplished by shifting the 16.16 left by twelve; or by shifting the 28.4 right by 12; or converting both to 22.10; or whatever. It's up to you. The important thing is that during the addition, they have to be in the same format and that the intermediate calculations (and final output form) are of the appropriate range and precision for your needs.

=== Multiplication ===

Fixed point multiplication is a bit uglier, but not too bad -- multiply the two numbers, then shift down by some number of bits to get back into the output format you want. This can get confusing if you don't understand the basis for the multiplication operation, so I'm going to show that to you soon. But first, let's look at an intuitive example to get a feel for how fixed point multiplication works.

Let's assume we're multiplying two 16.16 values against each other, 0x00010000 and 0x00020000 (1.0 * 2.0). A straight integer multiplication will result in a 64-bit product:

http://bookofhook.com/Images/Fixed_Point_Images/fixops_7.gif

Doh! The product won't fit in a single 32-bit register, so we've already encountered our first problem: overflow. We can handle this by doing this piecewise (just like grade school long multiplication), or we can hope/pray for an architecture that supports 64-bit intermediate results from a multiplication. Thankfully, both the 32-bit x86 and newer ARM series of processors support 64-bit intermediates and most modern compilers have 64-bit types ({{{long long}}} in GCC and {{{__int64}}} in many PC-based compilers such as Microsoft Visual C++).

So we're okay -- our initial 16.16 * 16.16 product is > 32-bits, but we'll assume that our compiler can handle that for us so we can safely get back down to 32-bit land for our final result.

Besides the overflow we can see another problem -- the output isn't shifted "correctly". We know that 1.0 * 2.0 = 2.0, and thus we'd expect our final result to be {{{0x00020000}}} (in 16.16 fixed point). Simple examination shows us that shifting down by 16-bits will give us the result we anticipate. So a 16.16 * 16.16 operation is performed as:

{{{
fixed16_16_t
mul_16_16( fixed_16_16_t a, fixed_16_16_t b )
{
long long r = ( long long ) a * ( long long ) b;

return ( fixed16_16_t ) ( r >> 16 );
}
}}}

But why do we shift to the right by 16-bits? And what if we're doing something like multiplying a 24.8 by a 10.22?

Now we get into a bit more of the nitty gritty -- it's just basic algebra, so this shouldn't be too frightening, even if you suck at math like I do.

Multiplying two fixed point numbers of ''m'' and ''n'' fractional bits, respectively, gives us:

http://bookofhook.com/Images/Fixed_Point_Images/multiply_1.gif

{{{
A = first number
m = number of fractional bits for A
B = second number, non-fixed point
n = number of fractional bits for B
}}}
Looking at the above we find that the product is scaled by the sum of ''m'' and ''n'', so the product is "overscaled". We can fix this by multiplying through by some value that will bring it back down into the range we desire. Given that we eventually want, say, ''f'' bits of precision (i.e. the result is scaled by ''f'' bits) in our result, we need to convert the 2^(m+n)^ term to 2^f^.

Multiplying through by:

http://bookofhook.com/Images/Fixed_Point_Images/multiply_4.gif

will scale things so that we have f fractional bits in our result since:

http://bookofhook.com/Images/Fixed_Point_Images/multiply_2.gif

So our final equation is:

http://bookofhook.com/Images/Fixed_Point_Images/multiply_3.gif

In plain English, that's just saying that if we want the product of two fixed point numbers, of ''m'' and ''n'' fractional bits respectively, to have ''f'' bits of precision, we multiply the two fixed point values then scale the result by 2^(f-(m+n))^.

Let's use our previous example of {{{0x10000 * 0x20000}}} (1.0 * 2.0 in 16.16 format) but this time with 24-bits of fractional precision in the result. We have ''m'' = 16, ''n'' = 16, and ''f'' = 24. The initial multiplication results in {{{0x200000000}}}, which we then multiply by 2^(24-(16+16))^ = 2^-8^ or 1/256. Of course, we'd just shift right by 8 places, which in the end gives us 0x02000000 which is 2.0 in 8.24 format.

One thing to keep in mind is that along with loss of precision (any bits shifted out of the result are lost), we still have to contend with overflow in some situations. For example, if we had multipled 0xFF0000 and 0x20000 (255.0 x 2.0) the result should be 510.0 in 8.24 format, or 0x1FE000000...which is larger than 32-bits, so it'll be truncated to 0xFE000000 (254.0) on systems that don't support 64-bit intermediate products. This is a silent error, and overflow bugs like this can be excruciating to hunt down.

One way to get around this is to prescale the two multiplicands down, avoiding the final shift, but in the process we lose precision in exchange for avoiding overflow. This is not the ideal way of dealing with insufficient intermediate bits -- instead a piecewise multiplication (which I might explain one day when I'm real bored) would be used to preserve accuracy at the cost of some performance.

The moral being: never assume you have enough bits, always make sure that your inputs and outputs have the precision and range you need and expect!

=== Division ===

Division is similar to multiplication, with a couple caveats. Let's look at an example to get an intuitive feel for the process. Assuming we have 2.0 and 1.0 again in 16.16 fixed point form: 0x20000 and 0x10000. If we divide the former by the latter, we get the value 2. Not 0x20000, but just 2. Which is correct, granted, but it's not in a 16.16 output form. We could take the result and shift it back by the number of bits we need, but...that's just gross. And wrong, since shifting after the integer divide means we lose precision before the shift. In other words, if A=0x10000 and B=0x20000 our final result would be 0, instead of 0x8000 (0x8000 = 0.5 in 16.16 fixed point, since 0x8000/65536.0 = 0.5). But if we do the shift first, we manage to retain our precision and also have a properly scaled result.

Well, that all makes sense, but we're assuming that A and B are of the same fractional precision, and what if they're not? Now it's time to look at the "real" math of a fixed point division. Again, this is grade school stuff:

http://bookofhook.com/Images/Fixed_Point_Images/div_1.gif

What this says is that if you take two fixed point numbers of m and n fractional bits, and divide them, the quotient will be a value with m-n fractional bits. With our previous example of 16.16 divided by a 16.16, that means 0 fractional bits, which is precisely what we got. If we want to have f fractional bits, we must scale the numerator by 2^f^-(m-n). Since of order operations matters when performing integer operations, we want to scale the numerator before we do the divide, so our final equation looks like:

http://bookofhook.com/Images/Fixed_Point_Images/div_2.gif

Order of operations matters, so actual evaluation would use the middle form, where the numerator is fully expanded before division. The far right form is just to show what the final result will look like.

Overflow is still a concern, so we have to make sure that we do the adjustment before the divide that we don't run out of bits -- this is all too easy unfortunately. For example, a 16.16 divided by a 16.16 needs to be prescaled by 2^16^, which is a guaranteed overflow. Again, modern CPUs can come to the rescue, this time by providing 64-bit / 32-bit division operations.

== Summary ==

Fixed point is a wonderful way to get around the limitations of systems with slow (or non-existent) floating point units. It also allows you to write very fast inner loops that can generate integer values without a costly float-to-int conversion. However, fixed point math can be very tricky due to range and precision issues. This article attempts to cover the basics of fixed point math so that at least a theoretical grounding is established before tackling the tricky implementation problems (which I might write an article about later).

Thanks to Tom Hubina and Jonathan Blow for commenting on and proof reading this article.


-------------------
http://trac.bookofhook.com/bookofhook/trac.cgi/wiki/IntroductionToFixedPointMath

Fixed-point math in C

Floating-point arithmetic can be expensive if you're using an integer-only processor. But floating-point values can be manipulated as integers, asa less expensive alternative.

One advantage of using a high-level language is the native support of floating-point math. This simplifies the task of programming and increases the accuracy of calculations. When used in a system that includes a hardware floating-point math unit, this can be a very powerful feature. However, if floating-point math is used on a microprocessor that supports only integer math, significant overhead can be incurred to emulate floating point, both in ROM size and in execution time. The alternative, used by assembly language programmers for years, is to use fixed-point math that is executed using integer functions. This article will discuss a method to implement fixed-point math in C.

Fixed-point math typically takes the form of a larger integer number, for instance 16 bits, where the most significant eight bits are the integer part and the least significant eight bits are the fractional part. Through the simple use of integer operations, the math can be efficiently performed with very little loss of accuracy. Unfortunately, C does not provide native support for fixed-point math. The method presented in this article is essentially the same as in assembly language and is described in the following steps.

Initialization
The first step in the fixed-point math algorithm is to initialize the system. This step defines the parameters used later in the development of the system. To illustrate each step, assume that your application requires a variable that ranges from 0 to 1 with a granularity of 0.01, another variable family that ranges from 0 to 100 with a granularity of 0.1, and a third variable family that ranges from -1,000 to 1,000 with a granularity of 0.01. The steps are performed as follows.

First, determine the maximum absolute value M that you wish to calculate for each class of fixed-point variable. The value of M for each example requirement is 1, 100, and 1,000, respectively.

Second, calculate the number of bits x required to store this number such that 2x M 2x-1.

If the number is to be signed, add 1 to x. For our example requirements, x is 1, 7, and 11, respectively.

Then, determine the granularity G that is required. The example requirements define the granularity and no further calculation is needed.

Next, calculate the number of bits y required to provide this granularity such that 1/2y G 1/2y-1. For our example requirements, this is 7 (1/128 = .0078125), 4 (1/16 = .0625), and 7.

Finally, the minimum number of bits required is x+y. Round this sum up to either 8, 16, or 32. This sum will be referred to later as the FULLSIZEINT. If this sum is greater than the maximum integer size in your system, then floating-point numbers are required and will probably be more efficient. The results for our example requirements for all the initial parameters are shown in Table 1. The resultant columns show the actual ranges and granularity available after the bit requirements are complete.

Table 1 Example initialization results
No. Range M x Resultant
range Granularity y Resultant
granularity Number of
bits required
1 0–1 1 1 0–1 0.01 7 0.0078125 8
2 0–100 100 7 0–127 0.1 4 0.0625 16
3 -1000– 1000 11 -1024–1023 0.01 7 0.0078125 32
1000

Definition
The second step in the algorithm is to create the definitions for C that are required to implement the variables. This consists of typedefs and macros, with the possibility that a function may need to be developed.

Typedef
Definition of the fixed-point algorithm requires a typedef for each fixed-point variable type that will be used by the system. This typedef is a union with an embedded structure as follows. The structure assumes that the compiler assigns the bits in an integer from most significant to least significant. If this is not the case, reverse the declaration of the structure members integer and fraction in the structure. This typedef is not portable, and is best put in a unique header file:

typedef union FIXEDX_Ytag {
FULLSIZEINT full;
struct partX_Ytag {
FULLSIZEINT integer: x;
FULLSIZEINT fraction:
FULLSIZEINT-x;
} part;
} FIXEDX_Y;

FULLSIZEINT is either long, int, short, or char and either signed or unsigned and cannot be longer than the maximum length integer available, or the structure will not work. If FULLSIZEINT is longer than this maximum, then floating point needs to be used.

Notice that the fractional portion of the structure is calculated as FULLSIZEINT-x, instead of using y as the size of the member. Since a complete integer is available, I have chosen to increase the granularity of the system in order to decrease the error of the calculations. In your application, you may wish to increase the integer member in order to provide a method to check for math overflow.

Macros
After the typedefs have been declared, macros need to be defined. Substitute the actual value for the equations using X and Y in each of the following macros.

Define a macro to convert a value into the fixed-point variable. A is the integer part and B is the decimal part, expressed as normal decimal. These values must be hard-coded constants so the preprocessor and compiler can completely resolve the macro into an integer. If either A or B are variables, then the macro will actually generate floating-point code and eliminate the savings of this algorithm. Check the results of the compile in the listing to make sure that the compiler functioned properly. (To do this, you must have the compiler interlace the assembly code with the C statements.)

#define FIXEDX_YCONST(A,B) (FULL-SIZEINT)((A<
Define macros to perform multiplication and division of the fixed-point variables:

#define MULTX_Y(A,B) (FULL- SIZEINT+1)(A.full*B.full+ 2^(Y-1))>>Y #define DIVX_Y(A,B) (FULL- SIZEINT+1)((A.full<
where FULLSIZEINT+1 is the next largest integer over X+Y. If FULLSIZEINT is the largest available integer, then either floating point must be used, or a subroutine is required for multiply and divide.

Listing 1 shows the definitions for one of our example requirements sets. In this file, I first define the integer typedefs so the code that is written is portable. Next, the typedefs and macros are defined.

Listing 1: Example typedef and macros

* Range 0-1.9921875
* Granularity 0.0078125

typedef union FIXED1_7tag {
unsigned char full;
struct part1_7tag {
unsigned char fraction: 7;
unsigned char integer: 1;
} part;
} FIXED1_7;

#define FIXED1_7CONST(A,B) (unsigned char)((A<<7) + ((B + 0.00390625)*128))
#define MULT1_7(A,B) (unsigned short)(A.full*B.full+64)>>7
#define DIV1_7(A,B)(unsigned short)((A.full<<8)/B.full)+1)/2

Listing 2 shows the multiply and divide routines used for the 32-bit example, assuming that a 64-bit integer is not available in the system, since that cannot be done in a macro. Notice that in the divide routine, I am using a floating-point number for the calculation. If your system requires division, the amount of memory used cannot be reduced this way, but the add, subtract, compare, and multiply routines will still execute faster than the comparable floating-point routines.

Listing 2: Multiply and divide for 32-bit variables

int MULT11_21(FIXED11_21 a, FIXED11_21 b)
{
int temp,result;
char sign = 0;

if (a.full < 0)
{
sign = 1;
a.full = -a.full;
}
if (b.full < 0)
{
sign^ = 1;
b.full = -b.full;
}

result = (((a.full & 0x0000FFFF) * (b.full & 0x0000FFFF))+1048576)>>21;
result = result + ((((a.full>>16) * (b.full & 0x0000FFFF))+16)>>5);
result = result + ((((b.full>>16) * (a.full & 0x0000FFFF))+16)>>5);
temp = (a.full>>16) * (b.full>>16);
result = result + (temp<<11);
return (result * -sign);
}

int DIV11_21(FIXED11_21 a, FIXED11_21 b)
{
double temp;
FIXED11_21 result;
unsigned char sign = 0;

temp = (double)a.full/(double)b.full;
if (temp<0)
{
temp = -temp;
sign = 1;
}
result.part.integer = temp;
result.part.fraction = ((temp-result.part.integer)*4194304 + 1)/2;
result.part.integer *= -sign;
return (result.full);
}

Usage
When using the fixed-point variables in an application, first declare variables using the new typedef. Next, use the new variables in the application as defined in the following paragraphs.

When adding, subtracting, or performing logical comparisons on two variables with the same-resolution fixed-point numbers, use the .full portion of the union and perform straight integer arithmetic. Listing 3 shows a small routine that performs each operation and prints the result for the number of type FIXED1_7. In this routine, the numbers are added and the result is printed to the display. A set of comparisons are then done to determine if a subtraction would work, and if so, the subtraction is performed and the results written to the display. If a subtraction would result in an unsigned underflow, an error is displayed.

Listing 3: Simple math on same granularity variables

void Test1_7(FIXED1_7 a, FIXED1_7 b)
{
&
nbsp; FIXED1_7 temp;

printf("Results of operations on 1_7 variables\n");
temp.full = a.full + b.full;
printf("Addition result is %d.%2.2d\n", temp.part.integer,
(temp.part.fraction*100+64)/128);
if (a.full < b.full)
{
printf("a is less than b. Subtraction overflows.\n");
}
if (a.full == b.full)
{
printf("a is the same as b. Result = 0.\n");
}
if (a.full > b.full)
{
temp.full = a.full - b.full;
printf("Subtraction result is %d.%2.2d\n", temp.part.integer,
(temp.part.fraction*100+64)/128);
}
}

When adding, subtracting, or performing logical comparisons on two variables with different-resolution fixed-point numbers, first scale the different resolution number to the same resolution as the result. Listing 4 shows a small routine that performs some mixed operations and prints the results. This routine performs the same function as Listing 3.

Listing 4: Simple math on different granularity variables

void Test7_9(FIXED7_9 a, FIXED1_7 b)
{
FIXED7_9 temp;

printf("\nResults of operations on 7_9 and 1_7 variables\n");
temp.full = a.full + (b.full<<2);
printf("Addition result is %d.%1.1d\n", temp.part.integer,
(temp.part.fraction*10+256)/512);
if (a.full < (b.full<<2))
{
printf("a is less than b. Subtraction overflows.\n");
}
if (a.full == (b.full<<2))
{
printf("a is the same as b. Result = 0.\n");
}
if (a.full > (b.full<<2))
{
temp.full = a.full - (b.full<<2);
printf("Subtraction result is %d.%1.1d\n", temp.part.integer,
(temp.part.fraction*10+256)/512);
}
}

When multiplying or dividing variables of the same-resolution fixed-point numbers, use the macros previously defined. Listing 5 shows a small routine that performs each operation and prints the result. This routine simply performs the multiply and divide without checking, then displays the results. With multiply and divide, you cannot pass the full value, so you must pass the structure name.

Listing 5: Multiplication and division on same granularity variables

void Test7_9_X(FIXED7_9 a, FIXED7_9 b)
{
FIXED7_9 temp;

printf("\nResults of multiply and divide on 7_9 variables.\n");
temp.full = MULT7_9(a,b);
printf("Multiply result is %d.%1.1d\n", temp.part.integer,
(temp.part.fraction*10+256)/512);
temp.full = DIV7_9(a,b);
printf("Divide result is %d.%1.1d\n", temp.part.integer,
(temp.part.fraction*10+256)/512);
}

When multiplying or dividing variables of different-resolution fixed-point numbers, use the same macro as the result and scale the operands as with addition and subtraction. Listing 6 shows a small routine that performs an operation like this and prints the results.

Listing 6: Multiplication and division on different granularity

void Test11_21(FIXED11_21 a, FIXED7_9 b)
{
FIXED11_21 temp;

printf("\nResults of multiply and divide on 11_21 and 7_9 variables.\n");
temp.full = b.full << 12;
temp.full = MULT11_21(a,temp);
printf("Multiply result is %d.%2.2d\n", temp.part.integer,
(temp.part.fraction*100+1048576)/2097152);
temp.full = b.full << 12;
temp.full = DIV11_21(a,temp);
printf("Divide result is %d.%2.2d\n", temp.part.integer,
(temp.part.fraction*100+1048576)/2097152);
}

The value can be displayed using the integer and fractional portions of the structure.

Words of caution
When these algorithms are implemented, a few areas of caution need to be addressed. If they aren't, erroneous results may be obtained.

The C language does not check integer arithmetic for overflows or underflows. The programmer must take care to either prevent or check for these conditions. This is usually accomplished by using a subroutine, possibly in assembly language, for each math operation and checking the limits. However, it can also be accomplished by adding bits to the integer portion, and limiting to a smaller number after each math operation.

The compiler may perform signed integer arithmetic in a subroutine and may not provide a tremendous benefit.

A rounding error may be injected in the result due to the value of the least significant bit. Since the expectation is an LSB of 0.1, and in fact the value is something less, an error of 1 LSB is injected that can compound during mathematical operations. For example, using our FIXED1_7 example, 0.02 is represented as a full number of 3, and 0.06 is represented as a full number of 8. The sum should be 0.08, or a full number of 10. The actual result is 11, which is closer to 0.09. To limit this error, use a larger number such as 16 bits, and increase the granularity from 7 bits to 15 bits.

Fixed-point math provides a small, fast alternative to floating-point numbers in situations where small rounding errors are acceptable. After implementing the algorithms described in this article, your application will be able to harness the power of C and still retain the efficiency of assembly.

All examples in this article were tested using Microsoft Visual C++ v. 4.0 Standard edition. The source code for the test files can be found on the web at ftp://ftp.embedded.com/pub/2001/lemieux.

Joseph Lemieux is a senior applied specialist with EDS Embedded Solutions in Troy, MI. He holds an MS in electrical engineering from the University of Michigan and has been writing software for embedded systems in the automotive industry for over 17 years. He can be reached by e-mail at joe.lemieux@eds.com.

-----------------------
http://www.embedded.com/showArticle.jhtml?articleID=15201575