Monday, July 16, 2012

Dissection of a sub-optimal function

Untitled Page

I came across the following function the other day and it cried out to me, “please refactor me, I’m so lonely.”  Alas the code isn’t mine to fix however a dissection of the cadaver should be interesting.

Here’s the code unmodified:

CString ConvertToBase38(CString csDec, int iLen)
{
    csDec.TrimLeft(); csDec.TrimRight();
    const int iMaxDigitsIn = 16;
       const int iMaxDigitsOut = 11;

       CString csResult=_T("");
    int i;
    int iLoops;
    char caBase38Chars[38] = {'0','1','2','3','4','5','6','7','8','9',
       'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O',
       'P','Q','R','S','T','U','V','W','X','Y','Z','-','.'};

    iLoops = csDec.GetLength();
    if(iMaxDigitsIn < iLoops)
    {
              csResult=_T("");
    } else
       {
              __int64 iValue = atoi64(csDec);

              BOOL bConvStarted=FALSE;
              for(i=iMaxDigitsOut-1; i>=0; i--)
              {
                     __int64 iPlaces = 1;
                     for(int j=0;j<i;j++)
                           iPlaces*=((__int64)(38));
                     if(iValue>=iPlaces)
                     {
                           bConvStarted=TRUE;
                           int iChar= (int)(iValue / iPlaces);
                           if(iChar>=0&&iChar<38)
                                  csResult += caBase38Chars[iChar];
                           iValue -= iChar * iPlaces;
                     } else if(bConvStarted)
                     {
                           csResult+=caBase38Chars[0];
                     }
              }
       }

       if(iLen!=0)
       {
              if(csResult.GetLength()>iLen)
                     csResult=_T("");
              while(csResult.GetLength()<iLen)
                     csResult=_T("0")+csResult;
       }
    return csResult;
}

If you're new to C++ (or lucky enough to work exclusively in U/Linux) you might not notice that the code above uses MFC CStrings which limits its portability, but I'll ignore portability since that was exactly why I encountered the code. 
  1. On the first line CString ConvertToBase38(CString csDec, int iLen) we see that a CString object has been passed to the function by value.  This will require a call to the CString copy constructor. Even though performance suffers due to the call to the constructor, MFC CStrings will do a lazy copy and the CString isn’t actually duplicated unless it is modified.  Unfortunately that modification will happen later on.  To avoid copying the CString with its call to the copy constructor we should use a constant reference. CString ConvertToBase38(CString const &csDec, int iLen)   Using a constant reference gives us the benefits of passing a value with a pointer or reference (fast without duplication) but keeps us from inadvertently modifying the variable.
  2. Next, we have csDec.TrimLeft(); csDec.TrimRight(); This modifies the CString forcing the string to be duplicated before it can be modified, thereby losing the benefit of the lazy copy.  Additionally, newer versions of MFC have a CString::Trim function that will remove leading and trailing white space.  But wait, why remove the whitespace at all? The subsequent call to atoi ignores leading whitespace we don’t need to modify the string, which allows us to use the more efficient constant reference.
  3. From there on to char caBase38Chars[38] = {'0','1','2'... Unless the compiler is really smart this array will be created on the stack and initialized every time the function is called, which is unnecessary since the array of base characters never changes.
  4. Before I forget the “iLen” parameter isn’t the length of the incoming string it’s the padding length, the name was confusing, but it became clear during the dissection, it is used in the code to ensure the result is left padded with zeros.
  5. Next the code checks for an input that is too long, this is an extra check that could be skipped, atoi will return 0 if there is a problem, once again like the whitespace, letting atoi handle the unexpected seems a preferable.
  6. After this the author goes to great pains to calculate the base conversion from left to right. Including calculating the value of 38 raised to power “i” with a for loop, the math function “pow” would be a faster since it would use the floating point processor. And faster still, since the values for 38 raised to x don’t change, the values could be stored / retrieved from a constant array. But…I think there is an even better way.

The last time I did an arbitrary base conversion function was in school, it was an assembly language class so I had the benefit of a stack and the full return from a divide operation (result and modulo).  I remember calculating from right to left using the stack to reorder the digits. Needless to say the assembly version was really fast and efficient. 
I think the same can be accomplished with some well written C++/C code.  Below is what I came up with.

#define MAX38DIGITS 11
const char BASE38CHARS[38] = {'0','1','2','3','4','5','6','7','8','9',
       'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O',
       'P','Q','R','S','T','U','V','W','X','Y','Z','-','.'};


char* ConvertBase10to38(const char *inVal, char *outVal, int maxLen, int padLen)
{
       long long value = atoi64(inVal);
       char *ptr = outVal;

       // build the base 38 number (reversed)
       while (value != 0 && (ptr - outVal) < maxLen)
       {
              int mod = value % 38;
              *ptr = BASE38CHARS[mod];
              value /= 38;
              ptr++;
       }
       // pad the string if needed
       while (ptr-outVal < padLen)
       {
              *ptr = BASE38CHARS[0]; ptr++;
       }

       // null terminate
       *ptr = 0;

       // reverse the string
       reverse(outVal, ptr);

       return outVal;
}

The profiler reports that my code as 5.7x faster with the slowest part being the “std::reverse” which improved to 6.7x faster with a custom reverse function.  Which is cool.

Wednesday, June 13, 2012

Exorcising the Bad Programmer in all of us

Dr. Dobb's most recent editorial What Makes Bad Programmers Different? was just a bit too binary for me. In my opinion every software developer occasionally slips and commits one of the sins he lists:
  • Their code is large, messy, and bug laden.
  • They have very superficial knowledge of their problem domain and their tools.
  • Their code has a lot of copy/paste and they have very little interest in techniques that reduce it.
  • The fail to account for edge cases, while inefficiently dealing with the general case.
  • They never have time to comment their code or break it into smaller pieces.
  • Empirical evidence plays no role in their decisions.

  • Which of us hasn't at one time or another rushed by a schedule or deadline let slip a programming peccadillo.  I know I have, the boss needs this right now, or I have to get this code checked in so John can use it for his part.

    I think the key is to find a way to exorcise the "Bad Programmer Demon" in all of us. Recently I've found that a daily "refactor" works for me. In my current position I have 2 projects that while I am not the only developer I have the majority ownership/responsibility for, each day I look for one thing that could be improved, commented or removed.  So far the results have been encouraging. Fewer bugs are slipping though to testing and overall the products "feel" more robust. Additionally I've noticed that the prospect of having to return to the code for a future refactor motivates me to do it right the first time.

    Just my $0.02

    chris' shared items

    Twitter Updates

    Official blog of Chris Lee Runyan

    Fastest C++ in the west.