TLDR: Really sorry for the interruption! I'll be taking steps to make sure this doesn't happen again.

Users, we fully restored media storage around 11:25 UTC on February 2nd. If you see old posts that didn't load properly, you could try refreshing. Here is the full breakdown of what happened.

I was notified that our media storage provider would be doing a full day of maintenance on Jan 31st. So on Jan 30th I rented a small backup server and started preparing to migrate our data. Our storage system uses LMDB, which has a quirk where its file size only grows and never shrinks. It had reached hundreds of GBs, so I tried to compress it manually first.

However, the server's RAM was much smaller than the database size, which crushed the disk performance. The process was much slower than I expected, and then my connection dropped, leaving the task in a mess. With the maintenance window closing in, I had to skip the compression and started the full migration. The backup storage performance was worse than expected, and with so many small files, it took several hours to finish. Because I was tied up with personal matters, there was a further delay of a few hours before I could actually bring the backup node online. That was the end of the first service gap.

A few hours later, the provider's maintenance finished. I realized the actual downtime would have only been about an hour, which made my decision feel pretty foolish. Regardless, I had to start moving everything back. This reverse migration hit a CPU bottleneck because rsync only used a single core, dragging the speed down to under 10 MB/s. I also realized I had to compress the database then and there, otherwise the massive file would have overwhelmed our bandwidth. This added another 10 hours to the recovery time.

To explain my current setup: we use a single-node system. I don't use the recommended three-node setup because of the cost. Three nodes would triple our storage expenses, making it more expensive than professional cloud storage, which I simply can't afford right now. To keep data safe, I take regular metadata snapshots and daily incremental backups. While this isn't high availability, it ensures we can roll back to a previous state if something breaks.

To avoid this in the future, I'm going to find a more efficient way to compress the database regularly and ensure I pick better infrastructure for any future migrations to avoid these kinds of bottlenecks.