Estuary.tech Post Incident Report 2022-11-02

Estuary users and community,

On 2022-11-02, Estuary suffered a total outage of our API node during routine maintenance, resulting in Estuary being unavailable for approximately 11 hours. This Post Incident Report, also known as a postmortem, is intended to document the incident and our response to it, as well as detail our plans for reducing and/or eliminating this kind of outage in the future.
Timeline of events

At 2022-11-02 12:40 UTC, we announced we would be doing a rolling upgrade and reboot of Estuary to patch a critical issue with the OpenSSL library. This would take the form of rebooting the API node, then the upload proxy, then each shuttle in turn until all nodes had been patched. We made the decision to also include some other minor and routine OS patches (apt upgrades) in the patching process to reduce interruption later. We estimated about 5 minutes of total downtime, with rolling interruptions as we rebooted shuttles to last for 15 minutes or so after that point.

At 2022-11-02 13:44 UTC we proceeded to reboot the API node after applying patches to it, and it did not come back up as we expected it should. At 13:52, we posted an update: "The API node has not come up as expected after the reboot, we're actively working on it and will update soon." in the #ecosystem-dev channel on Filecoin Slack, as well as on Twitter at roughly the same time.

We began contacting our internal resources to work on next steps for returning the machine to service and potential troubleshooting avenues. We also contacted our infrastructure provider for that node (in this case, Equinix Metal) for help debugging. We started forming a plan involving using the Alpine-based Rescue image to access the operating system of the node via chroot, but held off on executing that plan until we heard back from Equinix.

At 2022-11-02 14:38 UTC we were sent a screenshot of the video console of the machine in question which gave us some clues as to what was wrong with it. Most of the screenshot consisted of error messages to do with cloud-init, which is one of the packages we upgraded during the upgrade procedure. Believing that was what was blocking the boot process, we began investigating that avenue.

At 2022-11-02 14:45 UTC we rebooted the node into the Alpine rescue shell and began debugging the issues with cloud.cfg / cloud-init. The first thing we tried was installing the "new" cloud.cfg (as we had refused to replace it during the earlier apt upgrade), after safely backing up the current one. That did not help the situation, so we reverted to the current/older configuration. We managed to get the following error message once we delved deep enough:

Cloud config schema errors: disable_root: 0 is not of type 'boolean', ssh_pwauth: 0 is not valid under any of the given schemas
3:56

We fixed this issue by changing "disable_root: 0" and "ssh_pwauth: 0" to values of "false" instead of values of "0", and checked the cloud-init boot process which was now showing success.

At 2022-11-02 16:10 UTC, thinking we may have resolved the boot issues, we issued another reboot command and attempted to boot back into Ubuntu. Once again, however, the API node refused to boot. Worse, the machine refused to boot into the rescue OS after this point, returning a message of "Device is already booting into rescue OS". After a few minutes of checking that it would be safe to do so - that turning the machine off entirely wouldn't cause unintended effects that would slow down or disrupt the recovery effort further - we turned the machine off entirely and then attempted a boot into the rescue OS again to continue debugging.

Over the next 40 minutes or so we continued debugging the system and checking log files to determine why it was not booting. In particular, we verified the filesystem table was correct, and that the boot process should cleanly work.

We then checked the screenshot again, and noticed the following error message at the bottom:

"Failed to start default target: Transaction for graphical.target/start is destructive (emergency.target has 'start' job queued, but 'stop' is included in transaction)."

Searching for that error message surprisingly turned up a blog post by one of our own infrastructure staff (Wings), which helpfully explained it was most likely caused by a failed filesystem mount, and could be temporarily resolved by adding -o nofail to any suspect mount points.

At 2022-11-02 17:01 we were able to get the system to boot after nearly 5 hours of troubleshooting, however, the main data store - an LVM volume - was not mounting correctly. We quickly worked out that it wasn't seeing its physical volumes (PVs) correctly, and further investigation showed 4 drives physically missing from the devices tree. This indicated a hardware failure, but possibly one which could be corrected without data loss (as 4 drives are unlikely to fail at once, so a failure of that kind could indicate a bad cable, bad SAS controller, etc). We started on a plan to restore a new API node from our backups in case we were unable to recover the current API node, and communicated this at 2022-11-02 17:46 UTC on both Twitter and #ecosystem-dev, with the following message:

"Estuary users and community - We believe we've suffered a hardware failure. We've managed to get the Estuary API server to start by temporarily disabling the main data volume on it, and it appears we have 4 hard drives missing which was causing the issues we have been seeing during the maintenance. Our belief is this is a hardware failure within the physical machine; possibly a bad cable or controller, and that at this point we haven't lost data but that it's currently inaccessible -- our team believes it's very unlikely to have 4 hard drives fail at the same time.
We have backups of the data in question and are preparing to restore from backups while we wait for one of our providers to inspect the server in question. Hopefully they are able to resolve issue of the missing drives quickly and help us return api.estuary.tech to service -- if that plan fails, we will cut over to the backups and resume service. In that event, we may lose approximately 72 hours worth of data."

We proceeded with further debugging on the API node, and eventually got the 4 missing hard drives to become visible again on the system - however, LVM still refused to mount them. We're not entirely sure what caused this next issue yet - but the UUIDs used to identify the PVs used by LVM to access data were wrong or somehow inaccessible on 4 of the drives despite them "coming back".

Over the next few hours we kept communication going as best we could with our users, explaining the situation as needed.

At 2022-11-02 11:28 UTC we posted an update:

"We have a plan in place for migration to a new API node, and are preparing the new node now so that we may restore it from backups and resume service. We don't have a firm timeline yet but are working on it and will keep you updated as we progress through the recovery. We'll share a full retrospective of the outage once we've recovered and analysed what happened."

Minutes later, Alvin was able to get the current API node's missing data volumes to reactivate and mount correctly by issuing the command "vgchange -a y" and no data appeared missing. We examined the state of the data before cautiously resuming service, and by 2022-11-02 11:58 UTC we were fully back online.

"We've restored service using the old API node, and there doesn't seem to be any missing data at this stage. We're still planning a migration to the new API node, with a significantly more resilient setup, as well as upgrades to allow us to run multiple highly available API nodes in the future."

An additional update was posted approximately an hour later, closing out the issue until this PIR could be published:

"Hi Everyone, @outercore-eng
Estuary API node is now up and running. You can now access the API endpoint via https://api.estuary.tech. We were able to re-activate the volumes that had all the data, and there was no data loss during the downtime.
We still plan to migrate each node in the Estuary deployment to newly rebuilt servers with better resiliency and data safety.
We are fully transparent in what we do at Outercore and we will issue a public retrospective on the events of this and our action items!
Thank you!"

The outage lasted for approximately 11 hours, and affected most aspects of operation of Estuary. No data loss was experienced upon the conclusion of the incident.

Lessons learned

Upgrade risk

We took a small additional risk during the upgrade procedure of upgrading more than just the required OpenSSL package updates. This was a considered risk which factored in the routine nature of the upgrades, and their security importance. Initially our thinking was that this "additional" upgrade could have caused the outage. Further analysis shows that the failure of the node to mount its LVM volume was blocking boot and the upgrade procedure could not have been the cause of the issues booting - in fact, cloud-init failing to initialize properly wouldn't have affected the boot process much.

Nevertheless, this incident will inform our future actions with regards to the risks of upgrading multiple components, even via something as routine as apt upgrade.

Rescue response time

We initially waited for the vendor to respond before beginning real debugging on the system once it failed to return after a reboot. In the future we will likely proceed to utilizing the Rescue image options sooner.

LVM redundancy

This event highlighted a configuration issue related to the LVM volumes on our nodes - they were configured with no redundancy, spreading their data across 8 or more drives. We do not believe this incident was related to this configuration issue, but it does mean that a hypothetical failure of any drive would cause the loss of the volume. Suffice to say, this is a critical issue we're taking seriously and we are now in the process of migrating to a ZFS-based setup with 2 disk redundancy (commonly referred to as "raidz2") across all nodes, at the cost of losing some shuttle and API node storage capacity.

API single point of failure

We have been planning various infrastructure upgrades for the API layer of Estuary, including moving database operations to a dedicated highly available PostgreSQL cluster as well as rearchitecting the API layer to run across 3 independent nodes with a highly available load balancer cluster distributing load between them and responding to failures automatically. These infrastructure upgrades will go a long way to reducing or eliminating the kinds of failures we saw on November 2nd 2022, as well as reduce the impact of individual upgrades to each API node.

This incident further highlighted the need for those infrastructure upgrades, which are proceeding through staging testing right now.

Proper Maintenance Window

We should introduce a maintenance window for any upgrades and have a proper review process. We usually upgrade/patch OS during the weekend to minimize the risk and have more time to investigate if there are issues. Suggesting (and eventually doing so) to patch this during a weekday when everyone is potentially demonstrating Estuary is something we shouldn’t do. NetOps Involvement Consulting with the NetOperations team first would be a good step before updating or patching the OS. They are the only team who can communicate with Equinix, so we need to share or align with them first to get their input before deciding if we need to patch or hold off.

What we're doing about it

The outage described above is believed to have been caused by a transient hardware failure. In the short term, we are preparing to migrate away from the node in question, as well as keeping frequent backups in case of further failure.

Hardware failures can happen no matter which provider or setup you use. We use reliable, dedicated enterprise grade hardware which is resilient to most hardware issues (redundant power supplies, redundant OS storage, and more) but they will still occur from time to time. In the medium term (as soon as our new architecture is seen to be safe and ready), we will be migrating to a fully redundant setup with no single point of failure for the API and database layers of Estuary. We'll be sharing our technical details, diagrams and architecture over the coming weeks detailing this plan.

We are also working on improving our backup systems, to provide better, more frequent backups, as well as reduce the time to recovery in the event of a major failure.

Longer term, we have many steps planned for evolving the architecture of Estuary, taking advantage of our relationship with the Filecoin network to provide a safe way of offloading and retrieving data at scale (both routinely and as a disaster recovery option). We have plans for scaling our shuttles and turning them into "stateless shuttles", building a globally redundant and distributed object storage system for live shuttle data, as well as continuing to build out and improve our tight integration with Filecoin with tools like Autoretrieve.

We are committed to publishing post incident reports after any interruption to Estuary that is significant enough for users to notice, and to full transparency about the state of our infrastructure, our plans for its future, and the events that transpired during incidents.

You are welcome to come chat with us any time on #ecosystem-dev, or ping us on Twitter if you have any questions or suggestions about the infrastructure, Estuary, or any of the tools and systems in our ecosystem.

Thank you for your patience during this incident, and for being a user of Estuary and part of our ecosystem. We truly value every user of Estuary, direct or indirect, whether you use the flagship Estuary.tech instance or host your own, and we are working hard to improve and scale Estuary and ensure it is a stable, reliable service for all who use it. We will continue to learn from incidents like this and share our findings and learnings as they occur.

This was a gruelling outage, and we would not have been able to get past it without the heroic efforts of Alvin Reyes, Benjamin “wings” Arntzen and the NetOperations team at PL.

- Outercore Engineering

Subscribe to Benjamin Arntzen
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.