PHP Upgrade Story: No Data Left Behind

Adjacent problems that weigh you down as you tackle some tech challenge are often as important as that specific tech. I have a PHP upgrade story from version 5 to 7 that illustrates this, showing just another way legacy code can cause issues for years. But it's not really about PHP code, as the codebase was mostly compatible to make the jump from 5.4 to 7.x (eventually 7.3). And when I say mostly compatible, this part of the story is really just about going between the two MySQL extensions, from mysql to mysqli. The rest of the code as far as compatibility is concerned was fine.

To provide some context, this is a somewhat large business codebase that, like any legacy code worth its salt, kept the business running but was difficult to understand, maintain and change. It was the usual LAMP style of old web code with too many global variables and too much PHP mixed with HTML. Gradually restructuring that into something where PHP mostly provided services for single-page applications is not what this story is about, but it was a necessary process to pinpoint exactly what needed to be done.

Given that a considerable amount of important database queries were made out of complicated concatenation of strings, the obvious change that seemed easy to implement (after making sure all inputs were being checked for SQL injection) was to simply go from mysql to mysqli and use the correct encoding for the database. However, the real problem was on the other end. For some reason that I was never able to identify, the text that was being saved to this database by mysql was all garbled up, like "João" instead of "João". This is text that is already in UTF8 getting encoded to UTF8. But this would happen no matter what character encoding set was configured in the database. And testing mysqli on the same database with the same data saved the text as normal, which was good news at least.

So the challenge was how to convert years of garbled up records by mysql into readable text that would be consistent with mysqli. To do this, we experimented with writing several MySQL functions that could helps us identify the fields that had this problem and convert their text to readable characters. We also researched for what would be the best Unicode encoding that would preserve not only the many latin characters we had but also several emoji. We settled on utf8mb4 as tests with several converted pieces of text seemed to show that it could hold all the characters we needed.

Therefore, the basic function we used was something like:

FUNCTION utf8mb4Conversion(x LONGTEXT CHARSET latin1)
  RETURNS longtext CHARSET utf8mb4

And to insure that we were not converting text that didn't fit this latin1 to utf8mb4 assumption, we used something like...

FUNCTION fixOldEncoding(x LONGTEXT CHARSET latin1)
  RETURNS longtext CHARSET utf8mb4
  SET y = utf8mb4Conversion(x);
  RETURN IF(latin1mb4Conversion(y)=x, y, x);
END which the latin1Conversion function is just a mirror of utf8mb4Conversion.

So, to recap, we converted every text field in the database to something that humans can read in order to use the mysqli extension, as it works fine while mysql had some weird encoding issue we no longer need to worry about. The main motivation was to unblock the upgrade to PHP, but the immediate benefit was better encoding which allowed us to use emoji characters or easily order queries by their text fields.

I always find interesting how important it is to balance what developers can learn every day from our many resources and communities with the experience we gain from what our particular circumstances demand. In this case, here's a PHP upgrade story that wasn't easy, but it mostly didn't have anything to do with PHP. And this is only part of it, I'll share more in one of my next posts. See you soon.

Thanks for reading! Please subscribe to the RSS feed, maybe follow my Twitter or learn more about me.

More from 🌍; view from the web
All posts