DreamCompute downtime incident 2020

September 3rd, 2020 at 02:10

There was another DreamCompute incident leading to downtime of my site / server. Not nearly as bad as the incident last year, but still, my site was affected for what was likely a few hours. It was intermittently down entirely, and then for a while DNS requests from the server weren’t working, breaking OSCP stapling among other things.

I first noticed it being down around 2130. I was trying to visit my site, probably just to check out visitor stats, but it wouldn’t load. I tried another domain, then tried to SSH, but no luck.

I checked Dreamhost Status, which said some maintenance was going on. I looked on twitter and saw a post saying maintenance was taking longer than expected. I hadn’t known there was going to be maintenance in the first place, but since this was similar to last years incident, I figured the maintenance was the cause of my problems.

I checked the server in the DreamCompute admin panel, but it didn’t show anything wrong. I tried to connect both to the web server and via SSH off and on until I finally made it through after 10-20 minutes.

I started checking out the state of the server. The uptime was longer than the incident, so it hadn’t been restarted. Things largely seemed fine. I saw some gaps in the access logs, and though gaps aren’t unusual for my low traffic site, there was a 40 minute gap around the incident time-frame.

A few minutes into investigation, the SSH connection broke, as did web access, and they kept going in and out. I also noticed that I wasn’t able to visit in my main browser even when I could SSH and curl. After looking at the site error logs and attempting some pings and curl from the server, I realized that there was a DNS problem and OSCP stapling was failing. My quick and dirty solution was to disable it. I was able to connect over HTTPS in other browsers, but my main browser still was failing. It must’ve cached something about the failed requests.

During this time, I noticed a tweet acknowledging the incident by saying they were “investigating an issue resulting in degraded performance on specific customer DreamCompute instances.” Not much info, much like last time.

After testing ping and curl occasionally from the server for a while and getting failures to resolve, and still occasionally losing connection to the server, I decided to just give up for the moment, as there was nothing I could do, and go do some other things. When I came back around 0100, it was finally back to normal, other than an occasional hiccup in connection. I re-enabled OSCP stapling, and it was good. But, that was not really something I wanted to deal with tonight.

Related Posts