谷歌云因“服务器配置错误”导致全球性瘫痪
时间:2019-06-10
不正确的服务器配置更改抑制了多个地区的网络容量。
谷歌透露了一些细节,表明了导致周日大规模故障的根本原因。这次故障不光影响了谷歌自己的服务,包括YouTube、Gmail、谷歌搜索、G Suite、Google Drive和Google Docs,还影响了使用谷歌云的各大科技品牌。
谷歌的工程副总裁Benjamin Treynor Sloss在一篇博文中解释道,上周日(北京时间周一)故障的根本原因是针对一个地区的一小批服务器的配置更改错误地实施于好几个相邻地区的大量服务器。
随后,该错误导致那些地区无法使用一半以上的可用网络容量。
对于像YouTube这种高带宽平台造成的影响很严重,但对于谷歌搜索这样的低带宽服务来说不那么严重,谷歌搜索只出现了短暂的延迟增加。
Sloss说:“整体而言,YouTube在故障期间全球浏览量下降了10%,而谷歌云存储的访问量下降了30%。”
“大约1%的活跃Gmail用户出现了帐户问题;虽然这只占用户的一小部分,但仍相当于数百万用户无法接收或发送电子邮件。”
谷歌云状态仪表板显示,谷歌云网络在美国东部地区遇到了网络拥塞,影响了谷歌云、G Suite和YouTube。这次中断持续了4个小时,问题在太平洋时间下午4点得到了解决。
Sloss解释道,由于试图将入站流量和出站流量塞入到剩余容量,因此容量受限的地区变得堵塞不堪。
他特别指出:“网络变得拥塞,我们的网络系统正确地排查了流量过载,丢弃了更庞大且对延迟不太敏感的流量,以保留比较小且对延迟敏感的流量,就像遇到最严重的交通堵塞时通过单车运送紧急包裹。”
虽然谷歌的工程师“在短短几秒内”发现了这个问题,但解决问题所花的时间“远超过”原本预定的几分钟,这一方面是由于网络拥塞妨碍了工程师恢复正确配置的能力。
此外,正如一名谷歌员工在HackerNews的帖子中解释的那样,这次故障导致谷歌工程师们一直用来彼此沟通、告知故障最新情况的内部工具无法使用。
Sloss的帖子不是该公司承诺提供给客户的完整的事后分析报告,因为调查仍在进行中,旨在发现导致网络容量骤减、恢复过程缓慢的所有影响因素。
An update on Sunday’s service disruption
Yesterday, a disruption in Google’s network in parts of the United States caused slow performance and elevated error rates on several Google services, including Google Cloud Platform, YouTube, Gmail, Google Drive and others. Because the disruption reduced regional network capacity, the worldwide user impact varied widely. For most Google users there was little or no visible change to their services—search queries might have been a fraction of a second slower than usual for a few minutes but soon returned to normal, their Gmail continued to operate without a hiccup, and so on. However, for users who rely on services homed in the affected regions, the impact was substantial, particularly for services like YouTube or Google Cloud Storage which use large amounts of network bandwidth to operate.
For everyone who was affected by yesterday’s incident, I apologize. It’s our mission to make Google’s services available to everyone around the world, and when we fall short of that goal—as we did yesterday—we take it very seriously. The rest of this document explains briefly what happened, and what we’re going to do about it.
Incident, Detection and Response
In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.
Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage. The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts.
Impact
Overall, YouTube measured a 10% drop in global views during the incident, while Google Cloud Storage measured a 30% reduction in traffic. Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email. As Gmail users ourselves, we know how disruptive losing an essential tool can be! Finally, low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal.
Next Steps
With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration. We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.
Final Thoughts
We know that people around the world rely on Google’s services, and over the years have come to expect Google to always work. We take that expectation very seriously—it is our mission, and our inspiration. When we fall short, as happened Sunday, it motivates us to learn as much as we can, and to make Google’s services even better, even faster, and even more reliable.