[COE-1]Customer Apps Unavailable between June 9-10, 2023

June 10th, 2023

On June 9, 2023, at 18:04 (GMT+1) we received a message from a customer about not being able to access his launched site, receiving a blank page with a “Forbidden” error message.

After diving deeper, we found out that all launched sites with Webstudio were unavailable in both production and development environments.

We contacted AWS premium support for assistance and collaboratively identified the issue with our IPFS gateway provider Moralis. Talking to Moralis customer support they confirmed they published a change on June 9, to restrict certain URLs queries against their IPFS gateway due to scam complaints and reports. This restriction affected the way we queried our user apps stored on IPFS to serve them. No communication was issued from their side to notify users.

The issue was resolved by changing the IPFS gateway provider on June 10, 2023, at 7:05 and after testing, notifying users via Twitter at 7:44.

Architecture overview

To better understand the services and modules involved here is a high-level architecture diagram of how Webstudio serves its customer apps.

Customer Apps on IPFS resolution from AWS Cloud Architecture

In a nutshell, all applications built by users on Webstudio are packaged and published on the Interplanetary File System (IPFS) which is a decentralized file storage service.

There are 3 ways in which users can access these published websites:

Via their IPFS CID. Meaning they are directly read from IPFS hosting.
Webstudio subdomain e.g. myproject.webstudio.so where a Cloudfront Lambda Edge function retrieves the mapping based on the project ID from a DynamoDB table e.g. myproject → ipfs url and returns it to be rendered.
Custom domain, similar to above, a mapping is stored in the Webstudio DynamoDB table and the IPFS URL is served.

How was the incident detected?

We got approached by a user via Discord on June 9, 2023, at 18:04 (GMT+1) complaining that his site was unavailable.

What was the impact?

181 Webstudio customer applications were unavailable from June 9 around 18:00 until June 10 at 07:05 (13h 5min) downtime.

What discovery or investigation was done?

Once we received the report we checked the AWS logs for our specific customer and our metrics dashboard.

Our dashboards were clean and CloudFront nor the Lambdas were returning errors for any of the requests, they were returning correct mapping and redirection instructions for every application.

We did try reaching manually to the applications directly on the IPFS moralis gateway but never included the /index.html that is added by Cloudfront. Since the naked CID request worked, we discarded Moralis as being the root cause of the issue during our early discovery phase.

Timeline

Friday, June 9, 2023 (GMT +1)

18:00 (Aprox): Moralis publishes a change to their IPFS gateway service to restrict certain URLs from being queried in order to respond to phishing complaints.

18:04: An issue was reported by a user via Discord where his app was unavailable.

18:10: The tech team identified issue was widespread across all apps on Webstudio on both production and development environments. This occurred when querying using a webstudio or custom subdomain e.g. https://unityprefab.webstudio.so. However, if we queried directly the source through CID on IPFS for the same app we would get a positive hit.

18:20: No error log was found on API or Lambda Edge or CloudFront serving the front end for the users. Requests returned correct, expected results.

18:30: Consulting the AWS Health Dashboard we found an issue opened earlier that day (9:00) flagging an app built with Webstudio as a scam and requesting it to be removed within 48 hours.

18:40: The tech team cut a ticket to AWS Premium Support (basic plan, 24h SLA) for help and guidance on the matter.

20:30: The tech team attempted to scale the issue as no response was received from AWS Premium Support.

22:00: Upgraded to developer tier on AWS Premium Support as means of scaling the issue. The tech team fell asleep out of exhaustion, waiting for an email confirmation.

Saturday, June 10, 2023 (GMT +1)

5:37: Upgraded to business tier on AWS Premium Support as means of scaling the issue and accessed chat support.

5:45: Engaged with chat support on AWS.

6:00: Between the tech team and AWS support identified the IPFS provider was returning the forbidden 403 error for URLs ending in *.html.

6:15: The tech team has confirmed that the issue is only happening with the Moralis IPFS gateway and other gateways functioning properly e.g. Cloudflare, ipfs.io.

6:30: Engaged with Moralis support team and they confirmed they published changes to their IPFS gateway preventing certain URLs based on phishing complaints and reports they received. They did not notify us at any point of this change.

6:45: Tests have been done changing the Lambda Edge function and doing the redirection to use ipfs.io instead of Moralis IPFS provider.

7:00: Changes were implemented and rolled over to development.

7:04: Changes were rolled over to production.

7:05: Access back to applications built with Webstudio was confirmed.

5-whys and root cause

Why was there an outage?

Moralis changed their IPFS gateway API by imposing controls over specific URL patterns in order to control phishing sites and we were not aware of this.

Why we were not aware of Moralis’ changes?

Moralis did not notify customers about their plans for this change and we did not have the monitoring or alarm mechanisms to be alerted of sites unavailability outside the AWS infrastructure scope.

Why were users receiving correct results querying directly the same CID via Moralis and was not working when Cloudfront was doing the redirection?

Cloudfront in order to serve url pages and assets alike (html, js, CSS, jpeg, etc) explicitly includes the extension for the file for each URL so visiting the landing page of a project, Cloudflare would redirect to:

https://moralis.io:2053/ipfs/<CID>/index.html

Whereas visiting the same landing page from the Studio would redirect the viewer to:

https://moralis.io:2053/ipfs/<CID>

Since this last URL does not include an HTML file extension, the Moralis IPFS gateway did not block it.

Why did we not have monitors or alarms tracking this unavailability and got instead, notified by a user?

Our systems monitor and control our internal API and AWS infrastructure. We do not monitor Third Party providers’ availability at this time since we have implicit trust in their communication policies and service maintenance.

Why did it take 11 hours to fix the issue?

We received earlier in the day (Friday, 9:00) a report from the Amazon ec2-abuse team indicating one of our customers had published an app regarded as phishing and required us to take action without specifying a timeframe but with an urgent tone. This misdirected our attention making us think it was an arbitrary restriction from Amazon due to the complaint. In order to communicate with the Amazon team we had to upscale several tiers of AWS premium support that took over 10 hours. Since we could not see any errors on our lambdas on our Cloudfront we assumed we required AWS support to figure out.

It was not up until Saturday at 6:00 that we decided to upscale a subscription to AWS Premium Support for immediate feedback. At that time it was just a matter of minutes until they confirmed everything was right from Amazon’s point of view and that the problem was with the specific URL.

Testing and pushing the corrective measures were done in 20-30 minutes.

How was the issue resolved

We had identified the Moralis IPFS provider as the provider source of the issue so we changed to the ipfs.io gateway for serving our customer’s app data.

This change was performed in our Lambda Edge function on Cloudfront. You can view the commit here.

Were there existing backlog items for this issue? Was this a know failure mode?

No previous action item or known failure.

Overall learnings and recommendations

What went well?

Once the root cause was identified, the fix implementation was done within 20 minutes.
Once scaled the issue AWS Premium Support was super useful in helping identify the root cause.
Moralis customer support was outstanding as they replied early on a Saturday morning and helped confirm the issue.
We are bringing a CoE (correction of errors) mechanism to prevent issues from happening again in the future.

What went wrong?

Moralis, a third-party provider published a breaking change on Friday evening without notifying customers.
A phishing abuse notification from Amazon earlier in the day misdirected the attention to the real root cause.
It took us too long to get confirmation from Amazon support and cost us over $100 dollar for a 1 day upscale on AWSPremium Support.
We had no mechanism for monitoring third-party providers’ health, nor a direct hotline to verify critical issues. We have several critical dependencies with third-party providers that on the outage, can cause complete service unavailability.
We got notified by a user experiencing the problem, making the handling of this issue reactive.
The team was exhausted over a long week and could not troubleshoot properly.
At least was customer had an important event that day where we suffered the downtime and required maximum availability.
For IPFS to be a decentralized service, it is still reliant on centralized gateway providers.

Recommendations

Make sure you have short feedback SLA with your providers for support.
Make sure your providers notify you in advance of all changes that may have an impact on customers and that you diligently read through them.
Monitor the health of integrations between your services and your third-party providers.
Don’t do prod changes on a Friday if possible, this goes for providers as well.
Implement high-severity alarms and monitoring mechanisms that can notify you of an eventuality before your customers do.
Implement a banner or a notification system for users to be aware of the inconvenience and keep them updated on the progress.
Don’t overdo it day to day, every day, exhaustion makes you dumb when you are most in need.
Be transparent with your users and don’t hide facts.
If possible have backup integrations with alternative data providers that can kick in case of a downtime on their side.

What are the actionable tasks and follow-ups?

Webstudio

Add metrics for third-party providers’ health and integrations health Issue-30457393
Add high severity alarm on internal and third-party downtime Issue-30457445
Create a Runbook for handling third-party outages, including process, points of contacts and steps Issue-30457487
Implement a banner on the website to notify users while experiencing technical difficulties, and keep a log so they can track progress. Issue-30457511

Moralis

For every non-backward compatible release to production beyond APIs, document and analyze the possible downstream impact and send an email communication with at least 24h in advance of potential issues asking implementors to take action. Offer exclusive dedicated support during these transitions.

Subscribe to Giancarlo

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

OrHg0ESwnUJx5pi…8ARRN1WhXas8pkk

Author Address

0x06bd1006C1ACd8f…608f789Cb22A4F7

Content Digest

qGkk38cfScZrgCW…74aeg_RsXVEfYy0