Write PHP Code in English only: Unicode/UTF-8 and the BOM

First, a general introduction to the choices of human natural language within computer programming languages.

Since the beginning of time, programming languages supported the English language and only the English language, the English version of the alphabet, encoded as ascii characters [ok, not since the beginning of time but standard since consumer Internet support had solidified]. By ASCII I am referring to the lower 7-bit American character set. Even when foreign countries were using computers and began programming them, they would still use English programming languages, even the ones designed by non-native English-speakers and even languages with names like Pascal (which is still being taught in Vietnamese high schools, I hear).

Just like English had become the lingua franca of global commerce and, it was also the language of computing. It was necessary to pick up English in order to read Kernighan & Ritchie's "The C Programming Language". But over time, such books and tutorials were translated into many languages and programmers could program whilst only knowing a handful of English words, namely the keywords of the programming language: if, then, else, while, goto, etc.

Programmers started mixing their native tongue into function and variable names, transliterating their language into the standard set of English letters. And then some languages started supporting non-ascii identifiers, so you could use Greek letters in function names, or Chinese characters for variables. And now programmers can not only write code that's hard to understand, but also code which needs to be translated and also code which most of the rest of the world can't edit because they can't type characters in all those languages. And so just because a language supports exotic character sets, we should still avoid them in practice.

But even if we stick to English for functions, variables, and comments, there's one place where foreign language and Unicode can still appear: literal strings. This is "data" which shows up in your code / logic. In most languages, you write some letters between a pair of quotes (either single or double quotes) and you have a literal string. You can set variables to these strings or pass them into functions. And while your code should always be in English, your data may be in any language.

One solution is to separate your code and data. Keep your data in data files or in databases where it's also easy for non-developers to manage and edit (without accidently breaking the code).

But when it comes to code I see in the wild, I see this: function names, variables, comments in Vietnamese. Vietnamese is a language that uses a Latin-based script, with additional characters, or characters combined with diacritic marks (like accents). Ascii doesn't support it, so there have been a number of encodings over time (an extension of 7-bit ascii to 8-bit viscii, and others, and then Unicode). But unless all Vietnamese characters are transliterated down to English, then the document or file is no longer an ascii document. It might be UTF-8 encoded Unicode with mostly ascii.

The BOM. Now that we're dealing with Unicode, there may be a hidden BOM you must worry about. This BOM, which stands for Byte Order Mark, is there for a reason, but can break your website. One reason is that a hidden BOM in a PHP file when executed by the web server will cause the web server to send HTTP headers prematurely. Then later your normal code tries to set headers and you'll get an error Cannot modify header information - headers already sent. This can also be caused by unintentional whitespace in the PHP file before the opening php tag, or even whitespace after a closing php tag (thus it's recommended to omit that optional ending tag at the end of php files), but the BOM is invisible in most text editors when opening the file, causing frustration. output started is another symptom of this problem.

The BOM is added to files by certain text editors, again invisible to the user. It does appear in and affect the parsing of HTML by browsers like Chrome.

What does a BOM look like? Using a hex editor (od, hexdump) you can see the first few bytes:

  • U+FEFF
  • EF BB BF

Google to set your text editor to not mangle your PHP files in this way. Here's one way to find files which are infected by the BOM: find . -type f -exec echo {} \; -exec hexdump {} \; | grep -5 '000 bbef '

And this should fix a file from the command line: sed -i '1 s/^\xef\xbb\xbf//' <infected_file.php>

And another solution is to recompile and reconfigure your server's PHP to support UTF-8 encoded files with BOM using zend.multibyte. But your code will still break everywhere else, so this is not a good or recommended solution.

There's another type of reason why we should code in English. Code is meant to be written by humans, understood by computers, but then read and written by humans again. One doesn't simply write code once, never to be read again. Code should be reusable and reused. It should be shared on GitHub, to prevent others from wasting effort and time, to inspire new code and new uses for existing code, and to improve the code's quality through collective bug fixing. But none of that can happen when most programmers in the world can't understand your code, and usually they can't understand your code for one of a few reasons: the code is squirrely, deeply nested and hard to follow, going against convention and common patterns and not using any main code style guide (as a basis, read man 9 style from the terminal). Code may be hard to read also because the function and variable names are single letters or abbreviated, or in a language which means nothing to the reader. Code may be missing comments, or the comments are in a foreign language and might as well not exist. Software engineering isn't just about humans commanding machines through code, but about a way of managing the development of a system of code that must goes back and forth between machines and humans, and not always the same humans. Code needs to simultaneously talk to machines and humans. But code tends to be hard to understand by humans, requiring effort to counter that natural tendency.

PHP is a laggard when it comes to language features. OOP was bolted on quite late in the language's life. Unicode or language support come from bolted on extra functions for handling what are known as multibyte strings. And so today PHP still stands apart from other modern languages. Professional PHP programming is partly just knowing the best practices for working around PHP's deficiencies. And one of these practices: don't put non-English characters in any PHP files.