2019-02-16

Fedora Infrastructure Detective Work: Mirrorlist 503's

A Mysterious Problem

Recently there has been a large increase in failed yum/dnf updates for users with consumer's getting 503 errors when trying to update their system. This has caused problems in both the Fedora COPR system and with normal users. Trying to find why this problem is occurring was some interesting detective work that took me most of February 8th to February 14th and is still ongoing.

A Scandal in Fedoria

A history of the Fedora Mirrormanager software

The Fedora Project Mirrorlist system has evolved multiple times in the last 10 years. Originally written by Matt Domsch it underwent an update and rewrite by Adrian Reber, et al a couple of years ago. For many years Fedora used a server layout where the front end web servers would proxy the data over VPN to dedicated mirrorlist servers. While this made sense when systems were a bit slower compared to VPN latency, it had become more troublesome over the last couple of years.

Simplification of the older mirrorlist design

Originally most of the Fedora proxy servers were donated systems from various ISP's which made them network fast but not always CPU fast. This meant the system was designed to make the proxies mostly serve static content and relay anything computational to servers with more cpu cycles. As systems improved it made more sense to move the mirrormanager software closer to the proxies, however it wasn't until recently that using Moby and then Podman containers could be put into the mix.

Simplified version of new mirrorlist design
If you will notice neither of the above images mentions a database. There is one but it is a different system which various mirror managers will insert their systems and it will regularly trawl them to make sure they are up to date. It then creates a python pkl with all the network data of the updated ones. This is then pushed to each proxy which will feed it into new pods which are cycled once an hour doing a complicated dance.

  1. New pods are created from base container+new pkl and config data
  2. At 15 minutes after the hour, the first pod is told to drain out its old users.
  3. When it is drained, it is cycled with the new data.
  4. Repeat with the second pod and complete the dance.

The Final 503

Last spring/summer, we started getting reports on various mailing lists about users getting 503's causing dnf to fail. At first blush the number of failures didn't look too large as less than 0.2% of all requests resulted in 503's [We serve on average 20,000,000 requests and at that time 45,000 were 503's which was similar to what we had at times with the old VPN infrastructure.] However on further looking at the logs, it was clear these 45,000 connections were happening at a specific time frame... 15 minutes after the hour.

Doing some more work, it looked like the timeouts we had for waiting for the pods to drain of active connections was not long enough. Adding longer timeouts before killing the container brought the number of 503's down dramatically to an average of  450 per 20 million requests (or down to 0.002%). This seemed to fix things until December.

The Adventure of the Empty 503

In July, all our proxy servers were running Fedora 27 and able to get security updates like all good systems. In late November, our proxy servers were still running Fedora 27 and no longer able to get security updates. Kevin Fenzi put in a lot of hours and got both the containers and the proxy servers redeployed with Fedora 29. This allowed for us to move to newer versions of podman and other tools.

All seemed well until late January when a string reports of 503's started coming up again. At the time we had a couple of proxies who had stuck pod's for various reasons and I figured the two were related. However, the reports still happened after those problems had been fixed, and looking at the logs it was clear that instead of a 'normal' of 6000 503's a day, we were seeing peaks of 400,000 503's a day.

In looking at the combined log data, the first thing that stood out was that the problems were not happening at 15 minutes after the hour. Instead they were mostly happening at the time frame of :00 -> :05 after the hour. They also had peaks of occurring at 00:00, 04:00, 08:00, 12:00, 16:00, and 20:00. These times make a sort of sense in that they are commonly chosen to run daily jobs and at the top of the hour there are usually 3-5x more requests than 10 minutes before.

I then looked to see if the problem happened with a particular client (dnf vs PackageKit vs yum) or versions of those clients but they all happened across the board. [The only issue is that yum seems to retry if it gets a 503 while dnf gives a hard stop. ] At this point, Michal Novotny asked if I was just looking at combined logs.. and I had an 'Aha!' moment. I was looking at the combined logs, and had no idea if this was on one server or not. After looking at the original logs it was clear that proxy01.fedoraproject.org was getting the vast majority of the problems (the other proxies would generate a ~2000/server while proxy01 would generate 50,000 during a day). This again makes sense as both the COPR build system and several other systems seem to hard-code this server because it is in the main Fedora co-location.

At this point I went to see what logs we had. This took some work because of how we had setup the system, but in the end this popped up.

[Wed Feb 13 21:02:36.774319 2019] [wsgi:error] [pid 26286:tid 140136494905088] (11)Resource temporarily unavailable: [client 10.88.0.1:58258] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist.
[Wed Feb 13 21:02:36.774350 2019] [wsgi:error] [pid 26286:tid 140136520083200] (11)Resource temporarily unavailable: [client 10.88.0.1:58248] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist.
[Wed Feb 13 21:02:36.774443 2019] [wsgi:error] [pid 26286:tid 140136125822720] (11)Resource temporarily unavailable: [client 10.88.0.1:58250] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist.
[Wed Feb 13 21:02:36.774228 2019] [wsgi:error] [pid 26286:tid 140136058681088] (11)Resource temporarily unavailable: [client 10.88.0.1:58228] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist

The socket definitely existed so I went to look at backlog limits. Reading through various bug reports and log pages, I figured some options to try: graceful-timeout=30 request-timeout=30 listen-backlog=1000 queue-timeout=30. Kevin added them and rebuilt the images and I rolled it out to the proxies. The amount of failures went down dramatically and I figured it was due to allowing a larger backlog. Michal pointed out that was impossible because the kernel has a backlog size of 128 and the wsgi will just default to that no matter how much larger I made it. Reading through the man pages again I realized I had some cargo-cult going on. Michal then pointed out the change needed and I rolled this out to proxy01 to see if it would help.

The Last 503?

Currently proxy01 is still seeing 503 failures but at a lower rate than before. Before the changes it was averaging 50,000 503's a day and since the change it looks to be at 8,000.  I need to do more research to see what options will help. The increasing of timeouts may have helped but it may only be masking the problem elsewhere. We will need to look at increasing the number of pods that are available though that will increase memory usage and may mean some proxies are not usable for mirrormanager anymore. We also need to find out if other wsgi, haproxy or podman options would help. 

In any case it has been a very interesting week in detective work. I hope we can make the usage of our mirrors more reliable, and this has been useful to read. [I also expect several updates and fixes to this article as time goes on.]

No comments: