On Friday the 20th, we experienced a major service disruption. The vast majority of our existing deployments served traffic correctly, but three major capabilities of the system experienced outages:
- Deployment creation
- Auto-scaling of deployment instances
- Core self-hosted services like our chat invite system
The first two produced a major disruption to you and your teams’ workflow and ability to quickly ship updates and improvements. We apologize for that.
The final one impeded realtime communication with our team and made communication more difficult than it should. In this post we want to outline how we are tackling these challenges.
We have set up a dedicated website
on isolated infrastructure to keep you completely up-to-date on the status of our systems. Any disruption that we know about, no matter how minor, will be reported here.
We think detailed updates are critical. We will strive to go beyond mere status flags and let you know directly and concisely what is affected and what are our action plan looks like.
To give you the best product that we can, we host many services that make our web experience, including our site
, on Now. It gives us a tremendous advantage of experiencing our own product the way our users do, on a daily basis.
Our status site is purposefully hosted on isolated third-party infrastructure to minimize the risk of a communication breakdown.
You can also subscribe to these updates in realtime on Twitter: Follow @zeit_status
to get notified.
Communication of issues as they arise is essential, as issues in any sufficiently complex system are inevitable. We want to be completely transparent with our customers that trust us with their code for us to execute and scale.
Over the past months we have experienced tremendous growth. We are now seeing 5 times more deployments on a given day than we were seeing just a few months ago.
With this growth has also come illegimate traffic and abuse, which has added significant stress to our load balancers and deployment schedulers, introducing many load spikes we were unprepared for.
We currently have teams focused on the following major improvements:
- We are continuously improving our monitoring systems to react to changes before they escalate into major disruptions.
- We have enhanced our DOS protection and abuse detection systems.
- We have improved performance among multiple deployment pipelines that were underperforming as we scaled.
- We are in the process of deploying substantial scalability improvements that will positively impact all deployments.
Every single one of these initiatives will translate into concrete product improvements that we will let you know about over the coming weeks. Stay up-to-date by following us on Twitter
We look forward to sharing them with you.