MMX Instructions

In 1997 Intel added a set of instructions to their Pentium processor that were expected to be useful for audio and image processing. This was called MMX. The major feature it allowed was the adding of four 16-bit variables in one instruction (or alternatively eight 8-bit values or two 32-bit values).

It also features a mode where the value of the addition  ‘saturates’ rather than wraps. For instance, 100+200 = 255 using unsigned 8-bit saturation arithmetic. This is particularly useful in image and audio processing as wrapping often does not make sense in these cases. For instance, suppose we have a bitmap image where, after a header, every value is stored as 3 bytes in a row. Suppose we wish to make the image a bit lighter. It is simple to achieve this, as shown below.

// Assume 'data' is a character array holding the bitmap data, excluding
// the header. len holds the length of this array.
for ( int pos = 0 ; pos != len ; ++pos )
{
 int newValue = data[pos] + 50;
 if ( newValue < 0 )
   newValue = 0;
 if ( newValue > 255 )
   newValue = 255;
 data[pos] = newValue;
}

If I run this on my test image (which contains 844,048 pixels) it takes about 3.1ms, excluding the time to read to and write from file.

I wondered how much faster it would be using the MMX byte-wise saturated adding function. I shall be using the MSVC compiler and so MASM style inline assembly will be used. The instruction required is paddusb. The first parameter is a MMX register and the second location is either a MMX register or a memory location. The result is written to the first parameter.

So the basic idea is that we loop ‘data’, send the current 8-bytes chunk into an MMX register, call the saturated add function and then put the resulting variable back into the memory location we found it in. This gives us something like this:

unsigned char* end = data + len;
unsigned char toAdd[] = {50,50,50,50,50,50,50,50};
while ( data < end )
{
   __asm mov eax,data
   __asm movq MM0,[eax]
   __asm paddusb MM0,toAdd
   __asm movq [eax],MM0
  data += 8;
}
__asm emms // Restore state
// If data != end the last few should be adjusted without
// the use of these special instructions. I won't do this
// for demo purposes

Note that the ’emms’ instructed is required after we have finished with MMX, as otherwise floating point calculations won’t work properly (MMX and the floating point instructions use the same registers). ‘movq’ accepts a memory address as the second parameter, so we use the location that ‘data’ is pointing to. (We have to dereference it and so it must be in a register. It is not allowed to have an instruction like ‘movq MM0,[data]‘)

The same example now runs in 0.4ms – nearly 8 times faster.

Finally, we can optimise this a little. We are using ‘data’ as a counter and then keep transferring it to ‘eax’. Why not just use ‘eax’ directly? Finally, since we are not short on registers, we should make sure that ‘toAdd’ is always in a register. Let’s leave it in MM1. This gives us the following final code:

__asm
{
 // eax will contain the current offset
 mov eax, data
 // ebx will contain the final offset
 mov ebx, data
 add ebx, len
 // MM1 contains the constant to add
 movq MM1,toAdd
TOP:
    // load in the value to MM0
    movq MM0,[eax]
    paddusb MM0,toAdd
    movq [eax],MM0
    add eax,8
    cmp eax,ebx
    jl TOP
  // Again, deal with remaining bits here
  // for real production code
  emms
}

It now runs in 0.3ms – over 10 times faster than the original version.

Nothing clever was done here – just some (now) standard instructions were used in the most basic and standard way. However, the increased speed was incredible and so I thought I would share it.

Leave a comment