2019-05-30

EPEL Proposal: Steve Gallagher's EPEL 8 Branch Strategy

Stephen Gallagher's Better Proposal for EPEL branching

So earlier this week I wrote up a proposal for EPEL-rawhide which was to go over various ideas the EPEL steering committee has been kicking around for a bit. This was to try and work into how to branch for EPEL-8 and also how to deal with https://fedoraproject.org/wiki/EPEL/Changes/Minor_release_based_composes and https://fedoraproject.org/wiki/EPEL/Changes/Release_based_package_lifetimes During the meeting it was clear that my strawman didn't have much in it, and needed more thinking. Thankfully Stephen Gallagher looked into the meeting and came up with some ideas that he wrote up and proposed to the list.. I recommend that you read the document and updates if you are interested in how branching in EPEL could work with EL7 and EL8.





2019-05-28

EPEL Proposal: EPEL Master branch AKA Rawhide

EPEL-rawhide

Update:

This proposal has been superseded by Stephen Gallagher's excellent wagontrain post. I will put it as a separate post next.

tl; dr:

In order to allow for the ability for faster availability of packages, add rawhide branches for EPEL-7 and EPEL-8. These branches would allow developers to build new packages they aren't sure are ready for either EPEL-N or EPEL-N-testing, and would allow for faster rebuilds of newer features when RHEL has a large feature change.

The Longer Story

In the past 6 months, EPEL has had to have two major changes in its builds which were made harder by the way EPEL is currently built. The first one was with changes in RHEL-7.6 which dropped some packages and changed some others ABI's. This required a rebuild of a lot of packages, but there was no way we could do a find and fix before we did a 'flag-week' of rebuilds with Troy Dawson and others doing lots of Proven Packager fixes and rebuilds.

The second one was with the python36 move which also took a large amount of time and still has little problems showing up here and there. In a similar fashion, updates-testing had to be used as a rawhide for packages which made building and testing hard for things not doing this change.

A third problem showed up when Troy was cleaning out packages in EPEL-6 and 7 testing repos which had been there for years. The packagers were using this for putting things they felt were too unstable for EPEL due to unstable API's so they could either iterate quicker or not break existing users. The problem is that these packages might accidentally get  promoted by someone seeing that the packages are tested but wasn't pushed. Having a separate tree for these unstable packages needed a different thinking.

While doing a review of these two exercises, the EPEL steering committee came up with various ideas.. and I believe Kevin Fenzi brought up adding a rawhide as an easier fix than some of my more convoluted branch every release (aka epel-7.6, epel-7.7, epel-7.8). In this new scheme, we would have the following branches: el6, epel7, epel7-master, epel8, epel8-master.

A possible work flow could be the following:
  1. Packages when branched for EL7 or EL8 would get branched into the epel-M-master tree where they could have builds made against the latest RHEL. 
  2. When Red Hat released a new beta (RHEL-M.N-beta), Fedora Infrastructure would download it and set it up so koji could find it. as EPEL-M-Master (or properly bikeshed name). A mass update and rebuild would then be done against all packages in EPEL-M-Master. Breakfixes and testing can be done.
  3. When the General Availability of RHEL-M.N occurs, EPEL will make a copy of EPEL-M.(N-1), EPEL-M.(N-1)-updates and EPEL-M.(N-1)-updates-testing in /pub/archive/epel/M/M.(N-1)/.
  4. An after Red Hat releases the General Availability of the RHEL-M.N release, 
    1. if the version in master is newer than the version in branch, the master version will be checked into the branch. (This step is probably the most problematic and needs more work and thinking by people).
    2. packages which meet certain criteria will then be promoted to EPEL-M with a new compose of EPEL-M.N and an empty EPEL-M.N/updates and EPEL-M.N/updates-testing.
  5. The packager can do updates and fixes to packages in the EPEL-M branch 
  6. The finishing up of clean up the archives can occur.
This is a preliminary proposal which needs a lot more work and resource commitments in changes to tooling and documentation. I am bringing this up as something I would like to get done as part of revamping EPEL this summer, but I also need feedback and help.

EPEL Proposal: Removal of PPC64 (Not PPC64le) in 2019-06-01

TL; DR:

EPEL is looking to put its EL6 and EL7 branches of PPC64 into archives by 2019-06-01. This is due to the fact that Fedora no longer builds for the PPC64 big-endian architecture.

The long story

As of the EOL of Fedora 28, the Fedora Project no longer supports or builds packages for the big endian Power64 (or ppc64) architecture. Kevin Fenzi went over this in his blog article, but I wanted to go over it again. I realize this is short notice so extra steps need to be done.

The Fedora Project uses Fedora Linux on its builders which is useful for bringing on new architectures, and for getting new features which RHEL does not have yet. However it means that when an OS is End of Lifed, it no longer gets security updates, software improvements, or similar fixes. We could try and stand up an EL7 builder but it would require reworking both tools and scripts that are expecting an F28 world (python3, various newer libraries and scripts, different API's, etc). That would take a while to rework everything back and then continual work of keeping this builder in line with whatever EL8/F30+ world we move to in the coming months. Secondly, this would cut out a limited resource. We only have so many PPC8 systems which we can run PPC64 virtual machines on. The virtual machines can either build an EPEL package or a Fedora <29 be="" but="" down="" epel.="" just="" limiting="" p="" package="" this="" to="" we="" would="">
In the end, the number of PPC64 users are not that great. We have an average of 90 systems per day checking in with many more PPC64LE systems. I think most of the PPC64 users would be able to get stuff from archives just as well.

How do I get my stuff

The builds for EL6.10 and EL7.6 will be archived to /pub/archives/epel/7/7.6 and /pub/archives/epel/6/6.10 this week. We may need to roll out an updated epel-release which will point this architecture to that tree. We will then remove the builders from Fedora and stop building for it. In early July I will remove the remaining trees from /pub/epel and put in redirects to the archives.

2019-03-14

Final 503 addendum

mirrorlist 503's for 2019
This is a graphical shape of the amount of 503's we have had in 2019. The earlier large growth in January/February have dropped down to just one web-server which is probably underpowered to run containers.  We will look at taking it out of circulation in the coming weeks.

2019-03-13

EPEL: Python34->Python36 Move Happening (Currently in EPEL-testing)

Over the last 5 days, Troy Dawson, Jeroen van Meeuwen, Carl W George,  and several helpers have gotten nearly all of the python34 packages moves over to python36 in EPEL-7.  They are being included in 6 Bodhi pushes because of a limitation in Bodhi for the text size of packages in an include.

The current day for these package groups to move into EPEL regular is April 2nd. We would like to have all tests we find in the next week or so also added so that the updates can occur in a large group without too much breakage.


https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-f2d195dada
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-9e9f81e581
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-0d62608bce
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-5be892b745
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-0f4cca7837
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-ed3564d906

Please heavily test them by doing the following:

Stage 1 Testing

  1. Install RHEL, CentOS, or Scientific Linux 7 onto a TEST system.
  2. Install or enable the EPEL repository for this system
  3. Install various packages you would normally use
  4. yum --enablerepo=epel-testing update
  5. Report problems to epel-devel@lists.fedoraproject.org

Stage 2 Testing

  1. Check for any updated testing instructions on this blog or EPEL-devel list.
  2. Install RHEL, CentOS, or Scientific Linux 7 onto a TEST system.
  3. Install or enable the EPEL repository for this system
  4. yum install python34
  5. yum --enablerepo=epel-testing update
  6. Report problems to epel-devel@lists.fedoraproject.org

Stage 3 Testing

  1. Check for any updated testing instructions on this blog or EPEL-devel list.
  2. Install RHEL, CentOS, or Scientific Linux 7 onto a TEST system.
  3. Install or enable the EPEL repository for this system
  4. yum install python36
  5. yum --enablerepo=epel-testing update
  6. Report problems to epel-devel@lists.fedoraproject.org
This should cover the three most common scenarios. Other scenarios exist and will require some sort of intervention to work around. We will outline them as they come up.

Many Many Thanks go to Troy, Jeroen, Carl, and the many people on the python team who made a copr and did many of the initial patches to make this possible.

2019-02-19

503's.. the cliffnotes version

So I spent some time this weekend to try and show where 5 12 hour days went in analyzing data. This first graph shows the number of successful mirror requests and breaks down the 503 error's per server.
All mirror requests since 2018-01
The two drops are where logs failed to be copied to the central system versus a problem with the mirrors. You can see that Monday through Friday, Fedora sees a lot of requests and then Saturday and Sunday we see a dip. You can also see that we have gotten an increase in usage since November. The tiny area at the bottom is the number of 503's.. which kind of makes it look unimportant. [Unless you are doing a lot of builds and keep running into it.]

Just the 503's sir.
The above is just a graph of the 503's broken down by each server which sees them. We see that in January, we see a large increase of 503's on 2 servers. The first one is proxy11 which is in Europe and the server may be underpowered for what we are needing it to do. The second was on proxy01 which a lot of sites have hard-coded. If the above graph was done by the minute, you would see it as many many tiny spikes at :02 -> :10 minutes after the hour and most of the day empty. 

The graphs go to 2019-02-15 and the last 4 days have shown a decrease in 503's but proxy01 and proxy11 are still having several thousand a day. I am still looking at other fixes we can do to make this a less painful experience for people when the top of the hour occurs.

2019-02-16

Fedora Infrastructure Detective Work: Mirrorlist 503's

A Mysterious Problem

Recently there has been a large increase in failed yum/dnf updates for users with consumer's getting 503 errors when trying to update their system. This has caused problems in both the Fedora COPR system and with normal users. Trying to find why this problem is occurring was some interesting detective work that took me most of February 8th to February 14th and is still ongoing.

A Scandal in Fedoria

A history of the Fedora Mirrormanager software

The Fedora Project Mirrorlist system has evolved multiple times in the last 10 years. Originally written by Matt Domsch it underwent an update and rewrite by Adrian Reber, et al a couple of years ago. For many years Fedora used a server layout where the front end web servers would proxy the data over VPN to dedicated mirrorlist servers. While this made sense when systems were a bit slower compared to VPN latency, it had become more troublesome over the last couple of years.

Simplification of the older mirrorlist design

Originally most of the Fedora proxy servers were donated systems from various ISP's which made them network fast but not always CPU fast. This meant the system was designed to make the proxies mostly serve static content and relay anything computational to servers with more cpu cycles. As systems improved it made more sense to move the mirrormanager software closer to the proxies, however it wasn't until recently that using Moby and then Podman containers could be put into the mix.

Simplified version of new mirrorlist design
If you will notice neither of the above images mentions a database. There is one but it is a different system which various mirror managers will insert their systems and it will regularly trawl them to make sure they are up to date. It then creates a python pkl with all the network data of the updated ones. This is then pushed to each proxy which will feed it into new pods which are cycled once an hour doing a complicated dance.

  1. New pods are created from base container+new pkl and config data
  2. At 15 minutes after the hour, the first pod is told to drain out its old users.
  3. When it is drained, it is cycled with the new data.
  4. Repeat with the second pod and complete the dance.

The Final 503

Last spring/summer, we started getting reports on various mailing lists about users getting 503's causing dnf to fail. At first blush the number of failures didn't look too large as less than 0.2% of all requests resulted in 503's [We serve on average 20,000,000 requests and at that time 45,000 were 503's which was similar to what we had at times with the old VPN infrastructure.] However on further looking at the logs, it was clear these 45,000 connections were happening at a specific time frame... 15 minutes after the hour.

Doing some more work, it looked like the timeouts we had for waiting for the pods to drain of active connections was not long enough. Adding longer timeouts before killing the container brought the number of 503's down dramatically to an average of  450 per 20 million requests (or down to 0.002%). This seemed to fix things until December.

The Adventure of the Empty 503

In July, all our proxy servers were running Fedora 27 and able to get security updates like all good systems. In late November, our proxy servers were still running Fedora 27 and no longer able to get security updates. Kevin Fenzi put in a lot of hours and got both the containers and the proxy servers redeployed with Fedora 29. This allowed for us to move to newer versions of podman and other tools.

All seemed well until late January when a string reports of 503's started coming up again. At the time we had a couple of proxies who had stuck pod's for various reasons and I figured the two were related. However, the reports still happened after those problems had been fixed, and looking at the logs it was clear that instead of a 'normal' of 6000 503's a day, we were seeing peaks of 400,000 503's a day.

In looking at the combined log data, the first thing that stood out was that the problems were not happening at 15 minutes after the hour. Instead they were mostly happening at the time frame of :00 -> :05 after the hour. They also had peaks of occurring at 00:00, 04:00, 08:00, 12:00, 16:00, and 20:00. These times make a sort of sense in that they are commonly chosen to run daily jobs and at the top of the hour there are usually 3-5x more requests than 10 minutes before.

I then looked to see if the problem happened with a particular client (dnf vs PackageKit vs yum) or versions of those clients but they all happened across the board. [The only issue is that yum seems to retry if it gets a 503 while dnf gives a hard stop. ] At this point, Michal Novotny asked if I was just looking at combined logs.. and I had an 'Aha!' moment. I was looking at the combined logs, and had no idea if this was on one server or not. After looking at the original logs it was clear that proxy01.fedoraproject.org was getting the vast majority of the problems (the other proxies would generate a ~2000/server while proxy01 would generate 50,000 during a day). This again makes sense as both the COPR build system and several other systems seem to hard-code this server because it is in the main Fedora co-location.

At this point I went to see what logs we had. This took some work because of how we had setup the system, but in the end this popped up.

[Wed Feb 13 21:02:36.774319 2019] [wsgi:error] [pid 26286:tid 140136494905088] (11)Resource temporarily unavailable: [client 10.88.0.1:58258] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist.
[Wed Feb 13 21:02:36.774350 2019] [wsgi:error] [pid 26286:tid 140136520083200] (11)Resource temporarily unavailable: [client 10.88.0.1:58248] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist.
[Wed Feb 13 21:02:36.774443 2019] [wsgi:error] [pid 26286:tid 140136125822720] (11)Resource temporarily unavailable: [client 10.88.0.1:58250] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist.
[Wed Feb 13 21:02:36.774228 2019] [wsgi:error] [pid 26286:tid 140136058681088] (11)Resource temporarily unavailable: [client 10.88.0.1:58228] mod_wsgi (pid=26286): Unable to connect to WSGI daemon process 'mirrorlist' on '/run/httpd/wsgi.9.0.1.sock' after multiple attempts as listener backlog limit was exceeded or the socket does not exist

The socket definitely existed so I went to look at backlog limits. Reading through various bug reports and log pages, I figured some options to try: graceful-timeout=30 request-timeout=30 listen-backlog=1000 queue-timeout=30. Kevin added them and rebuilt the images and I rolled it out to the proxies. The amount of failures went down dramatically and I figured it was due to allowing a larger backlog. Michal pointed out that was impossible because the kernel has a backlog size of 128 and the wsgi will just default to that no matter how much larger I made it. Reading through the man pages again I realized I had some cargo-cult going on. Michal then pointed out the change needed and I rolled this out to proxy01 to see if it would help.

The Last 503?

Currently proxy01 is still seeing 503 failures but at a lower rate than before. Before the changes it was averaging 50,000 503's a day and since the change it looks to be at 8,000.  I need to do more research to see what options will help. The increasing of timeouts may have helped but it may only be masking the problem elsewhere. We will need to look at increasing the number of pods that are available though that will increase memory usage and may mean some proxies are not usable for mirrormanager anymore. We also need to find out if other wsgi, haproxy or podman options would help. 

In any case it has been a very interesting week in detective work. I hope we can make the usage of our mirrors more reliable, and this has been useful to read. [I also expect several updates and fixes to this article as time goes on.]

2019-02-15

Proposed Change To EPEL Policies: Release based package lifetimes

https://fedoraproject.org/wiki/EPEL/Changes/Release_based_package_lifetimes

Summary

The change moves expected package lifetime to that of a Red Hat Enterprise Linux minor release.

Owner

  • Name: Stephen Smoogen

Detailed Description

Extra Packages for Enterprise Linux is a sub-project of Fedora which recompiles various Fedora packages against various Red Hat Enterprise Linux releases. When EPEL was started, RHEL lifetimes were 5 years and it was thought that the repository could have similarly long lifetimes. Major rebasing of versions were not to be allowed, and fast moving software was frowned on.

Over time, the lifetime of a RHEL release grew, and the way RHEL releases rebased themselved overtime also changed. This has meant that packagers in EPEL were bound to support software longer than RHEL did and unable to rebase like RHEL could.

The current proposal is closely linked to release based composes but does not depend on it. The changes are as follows:

Packagers commit to maintaining software in EPEL for at least one RHEL minor release or 13 months, whichever is shorter. If a packager needs to stop maintaining the software, they should do the following:
  • announce on epel-devel list that they are no longer able to maintain the package and give the reasons. If someone can take it over they can do so here.
  • release one final version with an additional file named README-RETIRED.txt' in the %docs section which says this software is retired.
  • wait until that package arrives in updates.
  • let the EPEL release manager know that the package needs to be retired for the next release.

Otherwise the latest update of the package will be retagged for the next compose of EPEL and will appear without problems.

If the situation requires a major ABI/API change (security, a new minor compose occurs, a new LTS package is released) a packager should announce to epel-devel and put in a ticket in the epel pagure instance to track. Updates can then be made and will show up in either /updates/ or the next major.minor compose.

Benefit to Fedora

  • Packagers will feel more likely to make their packages available in EPEL.
  • Users of EPEL will have more software and know it is being kept up.

Proposed Change to EPEL Policies: Minor Release Based Composes

I proposed this on the EPEL list earlier this week, but realized it needed more views so am putting a version on the blog. The canonical version is at https://fedoraproject.org/wiki/EPEL/Changes/Minor_release_based_composes

Summary

The change moves EPEL composes to biannual based composes and adds an updates tree for consumers. Package trees will have a naming structure similar to Fedora release names, and will be regularly archived off to /pub/archives after the next minor release.

Package lifetimes will be similarly affected with the expected minimum 'support' lifetime of any package to be that of a minor release.

Owner

  • Name: Stephen Smoogen
  • Email: smooge@fedoraproject.org

Detailed Description


Extra Packages for Enterprise Linux is a sub-project of Fedora which
recompiles various Fedora packages against various Red Hat Enterprise Linux releases. Currently these packages are composed by the Fedora release engineering tools similar to Fedora Rawhide where only the latest packages tagged for a particular EPEL release (example epel-7) appear. When a package is updated or retired, the older version falls out of the EPEL repositories and users can not downgrade to older versions.

The tree structure would look like the following:

/pub/epel/releases/Major.Minor.YYYYMM/{Stuff,Modular}/{x86_64,ppc64le,aarch64,s390x,source}/{Packages,repodata}
/pub/epel/updates/Major.Minor.YYYYMM/{Stuff,Modular}/{x86_64,ppc64le,aarch64,s390x,source}/{Packages,repodata}

with 2 sets of symlinks pointing to the latest supported tree:

/pub/epel/releases/7.06.201903/Stuff/x86_64/Packages/
/pub/epel/releases/8.00.2019MM/Modular/source/Packages/

/pub/epel/updates/7.06.201903/Stuff/aarch64/Packages/
/pub/epel/updates/8.00.2019MM/Modular/s390x/Packages/

/pub/epel/releases/7 -> /pub/epel/updates/7.06.201903
/pub/epel/updates/7  -> /pub/epel/updates/7.06.201903

The proposed change will move EPEL minor releases to match with Red Hat minor releases.  For example, on 2019-02-10, Red Hat has Red Hat Enterprise Linux 7.6 in Full Support/Maintenance mode and has released a beta for RHEL-8. With the change, EPEL would compose a special set of trees for March 2019 and then all updates for those packages would go into the appropriate /updates/ sub-tree.

When the next minor release is released from Red Hat, a new compose tag will be made in koji. Packages which are not retired will then be pulled into the tag, and proven packagers can stage any mass rebuilds needed for rebases. When testing shows that this tree is ready, the symlinks will then be moved to the new tree, and the old tree will be prepared to move over to /pub/archives
Packages which are added to EPEL after a major.minor compose will only show up in the /updates/ tree until the next major.minor compose. This is the same as packages in Fedora.

Due to the fact that RHEL-6 has only a year and a half of Maintenance mode, we are not looking at making changes unless it turns out that we have to.

Benefit to Fedora

  • Users are able to downgrade to an earlier version if an update has issues for them
  • Packages will less likely break for Scientific Linux and CentOS users during minor updates.
  • Packages will be regularly archived off to /pub/archives/epel/ and consumers will less likely pull large number of old packages out of koji looking for older software.

Scope

  • Proposal Owners:
    • Add koji tags for date releases
    • Write scripts for releng to make new releases
  • Other developers:
  • Policies and guidelines:
  • Trademark approval: N/A

Known Change Impacts


Users who have hard-coded repositories or mirror specific directory changes may have a 'broken' experience due to symlink changes.
Archives will grow every 6 months as packages are put there regularly.

How To Test


We will stage this set of changes in /pub/alt/epel before rolling out to the main EPEL. Users wishing to test this can set up systems that will point to this tree using a temporary .repo file to be provided for testing.

User Experience


User's should be able to do the following:
  • yum install package-name
  • yum upgrade package-name
  • yum downgrade package-name

Contingency Plan


  • Contingency mechanism: (What to do?  Who will do it?) If testing shows this proposal to fail we will not release.
  • Contingency deadline: 2019-04-01
  • Blocks release? No (we don't do releases yet).
  • Blocks product? Yes


Release Notes


Discussed on epel-devel@lists.fedoraproject.org list with updates taken into concern.

2019-01-16

NOTICE: Epylog has been retired for Fedora Rawhide/30

Epylog is a log analysis code written by Konstantin ("Icon") Ryabitsev, when he was working Duke University in the early 2000's. It was moved to FedoraHosted and then never got moved to other hosting afterwords. The code is written in early python2 syntax (maybe 2.2) and has been hacked to work with newer versions over time but has not seen any major development since 2008. I have been sort of looking after the package in Fedora with the hopes of a 'rewrite for Python3' that never got done by me. [This is on me as I have been licking the cookie here.]

Because it requires a lot of work, and Python 2's End of Life is coming up in a year, I retired it from rawhide so that it would not branch to Fedora 30. I would recommend that users of epylog look for newer replacements (we in Fedora infrastructure will be doing so and I will post any recommendations as time goes by).