#2820

Today at work, a feature I worked on went from fun and cool as I was building it to a stressful fix and recover operation after it unintentionally deleted important site files when deployed.

The feature involved caching a yaml configuration file that had previously been parsed on every page load, automatically re-caching it if the file is updated, and using the cached version if the current version has invalid syntax. This was done as an aside because a site white-screened because of a mistake in the file yesterday. This issue was long standing and I wanted to finally resolve it. Improving performance by caching it was also something we’d wanted for a long time. That came into play because I needed to still have some config if it couldn’t get it from the current file. Quickly coming up with a solution to fix both these issues at once and improve our simple caching service for the future was cool and fun.

Some refactoring was what caused the deletion issue. The cache service was built off a quick and dirty solution I made for a site that does a lot of database queries and was having performance problems. It was copied into our base CMS and improved upon a little to make it more widely applicable. The site it was copied from had the problem of building up over 1GB of these cache files over time, mostly rare queries, so a quick and dirty solution was used to remove old files, basically running something like:

shell_exec("find {$this->path} -type f -mtime +60 -delete");

with a random 1 in 60 chance per request. This probably shouldn’t even be in the CMS version of the cache service, and is irrelevant for the config caching anyway. But it was there.

My refactoring changed the place in the code where $this->path was set, which happened to end up after that shell_exec() call. Thus, an empty string was given for the path to find, causing it to delete every file in the current directory and sub-directories that could be by the web user. For the home page, that current directory would be the web root. This included all images and many files uploaded through our CMS, as well as, for whatever reason, the stylesheets. Very noticeable to a site visitor. This happened with a 1 in 60 chance per request, so it was more likely on more visited sites. I hadn’t noticed it in local testing because I didn’t get the random luck of it happening the few times I loaded the site after making that final tweak to the code.

About 8 sites were affected. I was confused at first when my coworker showed me a site as an example for a feature he was working on, and I could tell no styles were loading. I thought maybe he had built the styles and something had gone wrong, and wondered if my change had somehow messed up the config being loaded. We quickly discovered that the stylesheets were just gone, and so were images. I wondered if we were hacked, and kinda started looking at that, but then thought about checking the cache service. I luckily noticed the issue quickly when re-viewing it, and did a quick fix live on the server before pushing a proper change.

Then came the recovery. My coworker looked up what sites were affected and rebuilt site styles while I tried to figure out how to restore just the files in question from Rackspace’s backups. The restore from backup action in the backups section leads to a file navigator where files and folders can be checked to add to the list that will be restored. It was slow and tedious to go through the wizard and pick all the files, and then took a long time, maybe 2-3 hours to get all the files in place. So some sites may have been missing images for as long as 4 hours. But it worked. I will be able to find it and start it much faster next time.

If you visited some mostly Akron area sites today and they looked messed up and missing things, that may have been my fault. Sorry, my bad. A learning experience for me, I guess.

Thanks to my coworker for helping out with my mistake and making it less stressful.