An introductory guide to optimizing your program, Part III


by: Robert J Ohannessian

Introduction:

Welcome to the third installment of the code optimization guide. In this article, I will present some speed ups for searching, and a small intro to SIMD. Note that I am assuming that you have previously read Part II of the article.

Searching:

Searching is a very strange algorithm. It involves looking into a data set for the presence of an element, and returning information about this element. This information can be either value, attached object, or even simply a boolean indicating if what is searched for exists in the data set. Searching is also very much used in game development, although it may not be obvious at first. Let's say you just built a tile engine for your platformer or RPG. Your game will most probably contain sprites (little images that are not part of the usually static tile map), such as the protagonist. Part of your engine will try and determine which sprites are visible on screen, and which aren't. We don't want to draw the ones that aren't on screen!* Another example would be if you wanted to know which bullets are close enough to your ship for the (expensive) per-pixel collision check to be done. "But, I can just use a bounding box for that", you'd say. Well, you'd be right, except if you had thousands of bullets, and you needed to make the checks 50 or 100 times per second - now you'd be in trouble. If every bullet needs to be checked to every other sprite, then we're running an O(n^2) algorithm! Another example used in game development is if you're writing a text parser for a script interpreter. You'd want searching for textual matches to be done quickly.

I'll be introducing several algorithms, none of which are optimal, but each one has strengths and weaknesses. It's up to you to decide which one to use and where.

* Yes, I'm aware that Allegro has automatic clipping, but the point I'm trying to make applies to broader range of problems. In 3D, for example, clipping of a mesh against the view frustum (or a random plane) is a very expensive operation, so you'd want to minimize those.

Brute Force:

Brute force is possibly the simplest approach to the problem. The algorithm is trivial: scan the data set linearly until a match is found, or until the end is reached in which case there is no match. Crude, but effective :)

Let's say you have the array:

    4 7 67 19

You want to know if '67' is in there. You'd first check if '67' was in position 0, then you'd check at position 1, and so on. At position 2, you notice '67' is there, so you return true - the number was found. Let's say however that you are looking for the number '100'. As done previously, you'd look first at position 0, then position 1, etc. At position 4, you realize that you are past the end of the array; you have no choice but to stop here and return no match (false). The algorithm for such a search is trivial:

    /* Search in 'array', of size 'n' for value 'x' */

    for (i = 0; i < n; i++) {
        if (array[i] == x)
            return true;
    }

    return false;

Brute force is also effective when you have linked lists, as they can be traversed linearly:

    /* Search in 'list', of size 'n' for value 'x' */

    current = list;

    while (current) {
        if (current->value == x)
            return true;

        current = current->next;
    }

    return false;

Let's do a bit of timing analysis now. The worst case happens when the element you're looking for is not in the data set. In that case, you need to go through the entire list to confirm this. Thus Brute Force is O(n).

This seems like a very good start, especially compared to those nasty O(n*ln(n)) we had when sorting, doesn't it? Well not really; as always, we can do better!

Binary Search:

Every problem can be resolved with sufficient amount of brute force (Slashdot sig)

Although the previous quote is quite right, it is not really optimal to do so. So let's explore alternatives! We saw that linear search works for any kind of data in any order, since it will check all of the data set anyway. What if we can impose constraints on the data set so that it only behaves the way we want it to; we'd then be able to exploit these constraints to our advantage. Ok, this sounds nicely theoretical and clear as mud. I'll try and explain what I mean via an example. Imagine we are searching for the number 100 in the following array:

  1 6 19 23 24 45 47 54 90 95 100 104 120 170

Notice how it's sorted; that would be our constraint. So, how can we take advantage of that fact? Well, we know for sure that as we go right, the numbers increase, and as we go left, they decrease. We want to try to minimize movements (and tests, by extension), so we'd like to pick a starting point where we'd move the least on average. Let's take the center of the array.

  1 6 19 23 24 45 47 54 90 95 100 104 120 170
                  ^^

The center value is 47. We are looking for 100, so we know it can only be on the right of 47 and never on the left (remember, our array is sorted!). Well guess what? We just cut our time in half! That's right, we're now at o(n/2) instead of o(n).

But wait, we can still do better! Let's lay out the right part of the array:

47 54 90 95 100 104 120 170

We know that either 100 is in this right part, or it's not at all in the array. This piece of the array is still sorted though, so we can apply the same technique as we did before, that is, pick the center of it:

47 54 90 95 100 104 120 170
         ^^

The number 95 is at the center this time. We know for sure that 100 will have to be at the right of 95, or not at all. If we'd found a number larger than 100, then we'd need to move left instead of right. Once again, we've divided the worst case time by two. In either case, we can continue repeating the algorithm until we either find our value, or we divided the array into a segment of less than one element. In that final case, the value we are looking for isn't present in the array.

As a small speed up, we can remove the end-point when we split up, since we know that the value we're looking for isn't there. For example, we could omit '47' from the array above, since we already checked if it was equal to 100 a few lines before.

The code for the search is a bit more complex than that of the previous method, but it's still easy to follow:

    /* Search 'array', of length 'size' for element 'value' */
    int binary_search(int *array, int size, int value) {

        int x1 = 0, x2 = size;
        int center;

        if (size <= 0)
            return FALSE;

        /* Loop while the segment size (x2 - x1) is larger than 1 */
        while (x1 < x2) {

            /* Pick the middle of the segment */
            int mid = (x1 + x2) / 2;
            center = array[mid];

            /* Did we find it? */
            if (center == value)
                return TRUE;

            /* Shift the search area if the value wasn't found */
            else if (center < value)
                x1 = mid + 1;
            else
                x2 = mid - 1;
        }

        return FALSE;
    }

Once again, let's analyze the running time of our routine. At every iteration, we're dividing the search area by two (roughly), so that after x passes, the area is 2^x times smaller. We fall in logarithmic time, and our algorithm is actually O(log2(n)) == O(ln(n)). That's pretty good!

So, we now have an algorithm that is very quick at finding things, but relies on the fact that the data set must be sorted. If elements are added or removed very rarely compared to the number of times the data set is searched, then it would be a good idea to sort the set every time we make a change to it. This will definitely speed up searching, which may be a bottleneck.

On the other hand, if searching isn't the bottleneck, or isn't done very often, then it may not be worth sorting the array every time there is a change. Remember: sorting is much more expensive than even brute force searching!

We're not done yet though - look at the scroll bar on the right, we're about half way there :D

Hashing:

Ah, one of my favorite topics - Hashing. Hashing is the act of creating a hash. So what is a hash? A hash is a transformation of our data into a form which may be more practical to use. Ok, let's take an example. Wouldn't it be better if value 100 was always at position 100 in the array, and that no other value could occupy that spot? It would then be trivial to know if 100 is in the array or not, we can simply check the location 100 to see if there's anything there or not. That's O(1)!! This method is very practical in that sense. However, the algorithm we have outlined above is rather limited. What if we wanted to store a very large range of numbers, such as -1 billion to +1 billion? We'd need 8 GB of RAM just for the array! That's no good. We need to make some changes; Enter the hash function.

The hash function's purpose is to map each data element to a position in the array, optimally in a way so that the elements are evenly distributed. An example hash function would be (ABS(value) % size), where 'value' is clamped to the range [0..size[. This might seem good enough, but poses a rather large problem. What happens if two numbers end up with the same hash value? For example, let's say the array is 100 elements large, and we want to insert element 1 and 101. Both will have '1' as a hash value, so which one of them is placed in position '1' in the array? It's not so simple anymore. We'll talk about this a little further down, but let's work on hash functions first.

Before we can think up hash functions, we need to know a little bit about our data. Is it a bunch of strings? If so, are they the same length? Are they composed of ASCII characters? Do they have digits? Is the data purely numerical? Is it integers? What is the range of those numbers? And so on. Once we know a little about our data, we need to think of a transformation function. The resulting numbers should be as evenly dispersed as possible to minimize collisions, that is, we don't want two pieces of data to end up with the same hash value if we can help it. Let's take an example application. We are building a spell checker that uses a dictionary to find whether a word is correct or not. We could do a brute search through all the dictionary to find matches, but that would be too slow, especially if we need to correct a 3000 word essay and have a 100,000 word dictionary. So we decide that we should use hashing to help it go faster. First, we need to write down what we know about the data set. The data will be English words: strings of length from 1 to 25 or so letters, containing only ASCII characters.

If any of you have used a real dictionary before (yes I know some of you have, you don't need to hide behind your keyboard :), well you might have noticed that the dictionary is ordered alphabetically, in 26 sections, one for each letter. When you search for a word in the dictionary, you start by finding the section with the same first letter as the word, and you'd look only in that section, and no other. Well we can do the same in our program! We just need to tell our program to only look in the words that start with the same letter as the word we want to correct, and no other section of the dictionary. For example, we can have a structure with 26 arrays, one for each letter. When we need to spell check a word, we just look up the first letter of the word, then compare that word with the ones in the proper array only. This greatly reduces the number of iterations of our spell check function, thus making the task really fast. On average, we just do 4000 compares/word, right? Well not quite, in English, there a lot more words that start with 'e' than 'z', so our hash isn't perfect. It does reduce the workload though, and that's what's important. We could also use another 26 arrays (in each of the previous 26 ones) for the second letter, giving an even smaller data set to compare to. Alternatively, we could compute some 16-bit CRC on the strings and use that as an index. This gives slightly better results in terms of repartition, but isn't so trivial to code anymore. You get the idea.

There is no One True Way (tm) to do hashing, it all depends on the application and your imagination :)

Let's take the example of Set Up Us the Bomb!!! once again. Here, we needed to sort polygons so we can draw them back to front. We can't use BSP trees since the world is dynamic. Octree or KD-Trees would also be an overkill. So what we decided to do is use a hash on each polygon and get a value out of its position. We can then sort by the hash and expect proper sorting, provided the hash was well picked.

For reference, the hash function we used was:

  poly->hash = total_z / 8 + max_z + SQDIST2(i-xc + .5, j-yc + .5) * 2;

where total_z is the sum of all z values of the polygon (after transformation), max_z is the maximum z value, (i, j) is the coordinate of the current tile, and (xc, yc) is the coordinate of the camera. Although this function works very well for SUUTB, it may not work well for a game like iD's Quake for example. If you must a use a hash function, be prepared to endlessly tweak it to get the best possible results.

Let's see the code for the dictionary alphabetical hash:

    /* Array of strings, each containing one word */
    char *dictionary[NUM_WORDS];

    /* This array contains the offset in the dictionary array for
     * the start of a certain letter
     */
    int  dict_hash_start[27];


    /* This function will check the dictionary to see if the word exists
     * in it. How it works: We check the first letter of the word we're
     * looking for, and look it up in dict_hash_start/end. This will give
     * us a very small range of words in the dictionary to compare to. The
     * hash start/end arrays will need to be properly filled in before
     * calling this function.
     */
    int is_in_dictionary(char *word) {

        char first_letter;
        int start, end;
        int i;

        /* The word can't be in the dictionary if there is no word... */
        if (!word || !strlen(word))
            return FALSE;

        /* Get the first letter of the word, capitalize it, then convert it
         * to a 0-25 offset
         */
        first_letter = toupper(word[0]) - 'A';

        /* Get search bounds */
        start = dict_hash_start[first_letter];
        end = dict_hash_start[first_letter + 1];

        for (i = start; i < end; i++) {
            /* Did we find the word? */
            if (!strcmp(word, dictionary[i]))
                return TRUE;
        }

        /* The word isn't in the dictionary */
        return FALSE;
    }

Ok, so now that we can compute hashes, we need to deal with collisions. As we mentioned above, collisions occur when the hash function returns the same hash value for different data. Since we can't store two things at the same place in an array, we need to deal with the previous fact. Let's have a look some the array used for hashing:

  [ X......X.XX.X..X...X ]

  X: Data
  .: No Data

As we can see, there are a lot of free spots, which is what usually happens when we use hashing. So what can we do when we need to put two Xs in one location? Well we can simply try the next location and see if that one is free. If it is, then all is well; we can place the data there. Otherwise, we can keep repeating this until a free slot was found. When we need to check if the data is in the table, we first compute the hash value and look at that position. If it's not there, we look at the next position, and so on, until we either find it, or loop back to the hash position, which will mean that the element is not in the table after all (remember to check all elements after and before the hash!). This is a rather inefficient method for searching, since it's O(n) (like the brute force!). However, it averages 1 on sparse data, so it is still better. We want to get rid of that O(n) factor though, because we don't want our nice hashing data structure to be just as efficient as brute-force. Another solution would be to compute a second (different) hash if ever the first hash collides, and if that one collides too, then we resort to linear insertion as above. This helps when the table isn't very populated, or when there are large clusters. However it's still O(n) because the second hash can fail too.

Another solution that can be proposed is to keep a binary tree at every table location. In that case, we can insert in O(n) (using insertion sort) if there is a collision, but can look it up in O(ln(n)) using binary search. Combining algorithms makes it go really fast now! We've achieved searching in O(ln(n)), but a complexity of 1 on average!

So you see, there's nothing stopping you from mix-matching techniques to get better results.

Intro to SIMD:

Here we will be getting into quite an advanced topic, so I'll only cover the basics. The rest can be found in any good (and recent!) assembly book, or from other people's source code. SIMD techniques should only be used once you've reached the minimum Big-Oh for some algorithm, and only for algorithms that 1) need them, and 2) can benefit from them. Let me repeat: Do not use this until you are sure that this code will not change, and that you cannot reduce the algorithm to a smaller Big-Oh! Otherwise, you'll just be wasting your time because you'll need to restart from scratch when the algorithm changes.

Let's start with defining what SIMD is. SIMD stands for Single Instruction Multiple Data. In short, it's a special type of CPU instruction that performs a simple task on multiple pieces of data at the same time. These techniques are very powerful and can make a certain routine two, five or sometimes ten times faster than the original one! For example, if you can add two numbers at the same time, then you've effectively doubled the speed of addition.

Consider the following loop:

    int sum = 0;

    for (i = 0; i < num; i++)
        sum = sum + array[i];

If we can do two additions at the same time (and only if!), then the loop would look like this:

    int sum;
    int sum1 = 0,
        sum2 = 0;

    for (i = 0; i < num; i += 2) {
        sum1 = sum1 + array[i];
        sum2 = sum2 + array[i+1];
    }

    if (num & 1)
        sum1 += array[num-1];

    sum = sum1 + sum2;

Those two loops return the same results at the end, except the second one has the potential to be done much faster. Notice that we have to explicitly take care of the case when there is an odd number of elements to add together.

Your C compiler however will not do this for you, you must code this yourself using assembly. Why is that? Well that's because SIMD is a relatively new thing, and not all CPUs support them (it's not so simple in real-life - there are many technical reasons for not doing SIMD at the C compiler level, but I won't venture into that mine-field). So the compiler takes the easy way out and doesn't generally use SIMD if you don't.

However, you can always do SIMD, even in plain C. Let's take the above loop as an example. Let's assume that the sum will never exceed 32,000. In that case, we can use the lower 16 bits of an int to store one running sum, and the upper 16 bits to store the other running sum. Notice how we're making big assumptions here: that the running sum will never exceed the value of 32,000, and that ints are exactly 32 bits. It is important to at least document these, as if ever the sum does not respect this assumption, then the code will break down and the result will be incorrect. The program won't lock up, shut down, or anything else; it'll just return an incorrect result.

Here's the code for the "improved" summing algorithm:

    int sum = 0;
    short *array;

    for (i = 0; i < num / 2; i++) {
        sum = sum + ((int*)array)[i];
    }

    if (num & 1)
        sum += array[num-1];

    sum = (sum >> 16) + (sum & 65535);

So you see, there's only one addition where there was previously two! I'll need to explain a few things though. Firstly, I'm assuming that array is of type (short), and that two shorts make an int. It doesn't matter what the actual size of shorts and ints are as long as there are twice as many bits in the int than there are in the short. This allows me to use half of the bits to store one running sum, and the full adder of the CPU to perform both summing operation at the same time.

Here's a small diagram of what we're doing:

  31              15               0
  ----------------------------------
  |    sum1       |     sum2       |
  ----------------------------------

sum1 will contain the sum of all odd elements, while sum2 will contain the sum of all even elements. Once the run through the loop is complete, we need to add these two running sums to get the total. We do this by shifting the top part to the right, and adding it to the bottom part.

Still following? Good.

When using SIMD, you'll always run into these trade-off problems: What is the range of my data? How does my data behave? Am I guaranteed that the assumptions will always hold? If not, then I should probably check that they do, if that's at all possible. Those are the same kind of questions you'd ask when doing hashing, and those questions are the ones you should -always- be asking yourself, if you want your program to work correctly.

SIMD doesn't only let you do running sums; you can also do multiplications, binary operations (and, or, xor) on packed values as well, etc. We'll get to some assembly coding a little later, but first, let's get the principles right.

Here's another (more complex) example of SIMD using C. If you remember Pixelate #1, I posted some code to do fast fading in 16 bpp. The algorithm for that operation is to take the red component of the source color, multiply it by some fraction lesser than one (to darken it), then place it correctly in the destination pixel, and repeat for the green and blue components. If you don't understand how colors are coded in 16 bit depth, then have a look at Draw- && Pixel Techniques ( Pixelate #3), which explains how the computer understands color values. That algorithm is actually a very tedious thing to do, and the routine would be very slow if I used the method I just described. Instead I'm going to do the same operations, but on all three components at the same time, thus saving me 2/3 of the work!

Here's the main loop, cut down slightly to make it easier to understand:

    for (y = 0; y < src->h; y++) {

        for (x = 0; x < src->w; x++) {
            unsigned long color;

            /* 1. Read the source pixel */
            color = source[x];

            /* 2. Split out the components */
            color = (color | (color << 16)) & 0x07E0F81F;

            /* 3. Multiply the components by the fraction */
            color = ((color * factor) >> 5) & 0x07E0F81F;

            /* 4. recombine components to a packed pixel format */
            color = color | (color >> 16);

            /* 5. Write pixel to the bitmap */
            bmp_write16(d, color);
            d += sizeof(short);
        }
    }

1. The first step involves reading the pixel from the source bitmap. You can use getpixel() if you like, but in the interest of speed, I'm using the line[] pointers (not shown).

2. Then we need a way to split up the components, spacing them enough so that if one of them overflows, then it won't overflow over the other components, which would ruin our calculations. Well, notice that in 16 bit color, the top 16 bit of our 32 bit variable 'color' is unused. Well, we can just shift the green component there! Using a bit of binary operators and bit shifting, we end up with the following:

  31              15                0
  -----------------------------------
  |................|RRRRRGGGGGGBBBBB|
  -----------------------------------

which is turned into:

  31              15                0
  -----------------------------------
  |.....GGGGGG.....|RRRRR......BBBBB|
  -----------------------------------

Now each component has adequate spacing around it, and everything is stored in one variable, making SIMD simple to use.

3. In this step, we do the multiplication by our factor. The factor must be in the range [0..32] so we multiply the components by it, then divide by 32. This works because if factor is less than 32, then the end component gets faded out just like we want (new_color = old_color * factor / 32). We need the final bit mask to get rid of possible garbage caused by the multiply-shift operations. In effect, we used a single multiply to do three!

  color:
  31              15                0
  -----------------------------------
  |.....GGGGGG.....|RRRRR......BBBBB|
  -----------------------------------

  color * fact:
  31              15                0
  -----------------------------------
  |GGGGGGGGGGGRRRRR|RRRRR.BBBBBBBBBB|
  -----------------------------------

  (color * fact) >> 5:
  31              15                0
  -----------------------------------
  |.....GGGGGGGGGGG|RRRRRRRRRR.BBBBB|
  -----------------------------------

  And the clean up:
  31              15                0
  -----------------------------------
  |.....GGGGGG.....|RRRRR......BBBBB|
  -----------------------------------

4. Ok, now that we have the new component values, we need to re-pack our pixel in 16 bit format so that we can place it back in the bitmap. For that, we can simply reverse the operation done in step 2: We shift the upper part down by 16, and OR it with the lower bits:

  31              15                0
  -----------------------------------
  |................|RRRRRGGGGGGBBBBB|
  -----------------------------------

5. Finally, we just write the new value in the destination, and we're done with fading!

That was easy wasn't it? If you understand what is going on in that code up there, then you understand how SIMD works. The problem comes from evaluating which algorithm can benefit most from SIMD; this problem we will discuss next time. I'll also give a more complex example (in assembly) in the next article. In the mean time, see if you can apply these short lessons to your projects.

Happy coding!

Robert J Ohannessian