Paul Biggar's blog archive

Introducing Malicious Code Reviews

I’m inventing a new sport today, which I call “malicious code reviews”. I spent a few hours reading some really very bad code, and in retaliation against its author(s), I’m going to code review it [1]. The code comes from PHP version 5.2.8, the latest stable release. This particular file is Zend/zend_operators.h [2]. You might want to open it in a new window, or in a popup, so that you can follow along.


I’ll start [3] at the top:

#if 0&&HAVE_BCMATH #include "ext/bcmath/libbcmath/src/bcmath.h" #endif

I’ve skipped a few minor problems to go straight to the laughably poor, the #if 0. Funny story though: I saw a “code review” in PHP recently which chastised the addition of a #if 0. I initially thought that, finally, someone is actually stepping up to stop the rot within the PHP engine. Sadly, they instead complained that according to the rules of the PHP project, an #if 0 must also have the author’s name added to it. The mind boggles.

There is a limited amount of reasonable code, which I’ll skip, followed by the is_numeric_string() function [4], only one of the finest examples of poor code I’ve ever seen. I linked to it above, so I recommend that you actually read along as I go. This will be no fun unless you can actually see it. It starts off, surprisingly, with the most thorough comment I have yet to see in PHP. It is merely average in terms of what you might read in a gcc source file, but here it is a shiny gold nugget floating in a murky brown sea. However, it degrades fairly rapidly. You might notice this giant function is in a header, and that it is declared to be static inline. This is a prelude for what’s to come. The function starts with some whitespace skipping [5]:

while (*str == ' ' || *str == '\t' || *str == '\n'
      || *str == '\r' || *str == '\v' || *str == '\f') {
  str++;
  length--;
}

Just two lines above the function they have two macros, called ZEND_IS_DIGIT and ZEND_IS_XDIGIT. Could they not have added ZEND_IS_WHITESPACE? A pity, but a tiny flaw compared to the gaping maw of despair that follows a little bit later. The code continues fine: check for a digit, check if its hex (with comments, very good) until we come to this line:

for (type = IS_LONG;
    !(digits >= MAX_LENGTH_OF_LONG
       && (dval || allow_errors == 1));
     digits++, ptr++)

I’d like somebody to come forward and explain why

type = IS_LONG

is in the loop initialization statement. And why the loop condition is so unreadable. And why the elements of the loop header are not related in any way at all!!! But this is just the start. The next line is a doozy:

check_digits:

Do you feel the fear? I feel the fear. Its a label. That means that somewhere in this function, there is a goto. Not that there’s anything wrong with gotos. Sure, if misused they can lead to unreadable, spaghetti co– OH MY GOOD GOD. I’ve found some gotos, but they go to a different label. Two labels. And the first one is in a for-loop! Don’t panic, maybe its readable. Maybe the second one is also in the for-loop. Please? Pretty please?

Fuck. Fuck fuck fuck. I’d like to suggest you try to work it out yourself, but you’d probably prefer not to. If you haven’t looked at the code yet, now is the time. Don’t worry if you don’t know C – with code like this, knowing the language is not the advantage you’d expect.

The first label, check_digits, is in a for-loop. That for-loop has two gotos (and one continue and one break, just to make things more readable), which both go to the other label process_double. process_double is outside the loop, deeply nested in a completely separate series of if-else statements. They have also given up on comments by now). After checking a few more conditions, only then do you jump back into the previous for-loop!!!! Oh wait. No, that’s not right. They’re in different paths. There actually added control-flow edges from an else-body to within a for-loop in the if-body. Just wow.

I’d like to say that this horrendous function is over. After all, this is just the first function I’ve come across. But while there is only simple (but uncommented) code remaining in the function, I have a final nit to pick.

if (ptr != str + length) {
    if (!allow_errors) {
        return 0;
    }
    if (allow_errors == -1) {
        zend_error(E_NOTICE,
            "A non well formed numeric value encountered");
    }
}

There is a check to see if allow_errors is -1, even though the comment only mentioned two possible values for allow_errors. So what does it mean for it to be -1? We’re saved from figuring it out because the check can’t even trigger. If allow_errors was non-zero, the function would have returned already.

Now, you might consider this a minor nit, and its easy to see that it wasn’t fixed when you consider how deeply it was nested [6]. But this sort of thing is the rule, not the exception in PHP sources. Broken windows built on top of other broken windows.


The rest of the header is OK, considering. There is a macro that’s not appropriately guarded [7], and a few with no guards and no comments. A few macros use their parameters more than once (bane of post-incrementers everywhere), and some duplicate some code between them, or write the same code in multiple ways [8]. There are then a ton of macros which can be used as lvalues:

#define Z_DVAL(zval)            (zval).value.dval
#define Z_STRVAL(zval)          (zval).value.str.val
#define Z_STRLEN(zval)          (zval).value.str.len

except one

#define Z_BVAL(zval)            ((zend_bool)(zval).value.lval)

where an enterprising soul mustn’t have thought hard before committing. PHP has long been lambasted for its inconsistency; it turns out that this inconsistency is not limited to just its libraries, syntax or semantics.


To finish off this file is a frankly baffling piece of code. Let this be a lesson about commenting. Sometime the ‘why’ of the comment is not sufficient – occasionally, you will need to explain what the code is doing.

#if HAVE_SETLOCALE && defined(ZEND_WIN32) && !defined(ZTS) \\
    && defined(_MSC_VER) && (_MSC_VER >= 1400)
/* This is performance improvement of tolower() on Windows
 * and VC2005
 * GIves 10-18% on bench.php
 */

And we’re done

Unfortunately, this file is not an isolated incident. The entire Zend/ directory – the core of the whole PHP implementation – is a filthy mess. While the is_numeric_string() function might be the worst code I have ever seen, most of the Zend/ files contain a lot of hideous code: poor organized, badly written, badly documented, unreadable messes.


Notes:

[1] I’m aware that as a “code review”, this is actually pretty poor, primarily due to its lack of constructive criticism. So this is more of a detailed, flame-bait-y rant about the quality of code in this file, which I will use to assure people that the rest of the code I’ve seen in the PHP project is of similar caliber. And none of that is really very constructive.

As it happens, I am preparing a much more constructive piece about why the code in PHP is so bad, how it got this way, and how to fix it. But first I’d like to demonstrate how poor the code currently is, and it’s difficult to do this without some bile and vitriol.

[2] I was originally planning to do zend_operators.c, and thought I’d quickly do the header first. But this was so poor that I ended up writing about it.

[3] I didn’t want to bog down the intro with my method, but it should probably be here anyway.

[4] Unfortunately, is_numeric_string() was cleaned up at some point (though the version I review here is what’s in all the 5.2.x releases, and is scheduled to be in 5.3), so this post loses a smidgen of its sting.

[5] I should point out that I’ve tidied up the code to fit in the blog. So if you’re thinking “at least the lines aren’t too long”, well, they are.

[6] When they fixed this particular piece of code, the allow_errors checks became much less obfuscated, but still was not removed, sadly.

[7] A macro guard (if that is indeed the right name, and not one I just made up), is when you wrap a macro in a do-while(0) loop.

[8] If you know Zend internals, spot the difference:

SEPARATE_ZVAL_IF_NOT_REF(ppzv);

and

if (!(*ppzv)->is_ref) SEPARATE_ZVAL(ppzv);