2017-09-06

Fedora Project Outage RCA :: DNS Outage 2017-09-06


Early on 2017-09-06, many people attempting to reach fedoraproject.org
found that it had disappeared from the internet. People attempting to
do 'yum/dnf install', browse the website, or other Internet related
activities were getting various error messages that the sites no
longer existed in DNS. Some people had no difficulty and were not
able to duplicate the problem, but anyone who was using a DNS server
that had dnssec checking turned on were unable to get any IP address
lookups related to the site.

The problem was due to a misconfigured record in the registrar's data
about DNS. The previous week, multiple records had been added by the
registrar to the DNS data in the .org. DNS table. The records were the
DNSsec records for fedorapeople.org, fedorahosted.org, and
fedoraproject.org, and the registrar had added them to fedoraproject.org.
versus each to the correct zone. In seeing this, I asked for two of
the records to be removed, and somehow confused which one was to
stay. This meant that the key meant for fedorahosted.org. was left for
fedoraproject.org and the fedoraproject/fedorapeople were removed.

When the registrar updated its .org. data early UTC on 2017-09-06, DNS
servers like Google's 8.8.8.8 dns no longer would show any addresses
inside of Fedora's dns tables. Other dns servers also were no longer
working and people who are on the IETF for DNSsec came into help in
case there was some other problem going on.

After diagnosing the problem, Fedora IT contacted the registrar and
got the correct DNSsec keys added to the master table. This cleaned up
the problems with many DNS servers but some will cache the broken data
for up to the TTL of 24 hours so users were still having problems as
of 2200 UTC 2017-09-06. A temporary fix is to hard code the main proxy
ip address into /etc/hosts, however this can cause problems later if
not removed and the main proxy is down for maintenance.

I would like to thank the members of the IETF dnssec group who took
the time out to help us through this problem. I would also like to
apologize to everyone who had disruption due to this.
Post a Comment