Growth of Fedora Distribution over Time

Growth of the Fedora Distribution over time

There was a conversation in IRC (libera.chat, #fedora-admin) on the amount of disk space that Fedora is using over time. It used to grow astronomically over time, but there was an idea that it might be slowing down.. and then the realization that no one had graphed it. Taking this challenge in hand I decided to look at it. Doing a complete mirror of the data would require me to have a very long time frame and 100+ TB of disk space, but luckily for me, the Fedora mirror system does a du every night and outputs this data to a file, https://dl.fedoraproject.org/pub/DIRECTORY_SIZES.txt

The file covers all the directories that the main download servers have including the archive trees which are where old releases go to live. It also puts it in a ‘human-readable’ format like

egrep '/rawhide$|/releases/[0-9]*$|/updates/[0-9]*$|/updates/testing/[0-9]*$' DIRECTORY_SIZES.txt | egrep -v '^8.0K|^12K|^4.0K|/pub/epel|/pub/alt' > /tmp/dirs
$ grep '/7' /tmp/dirs 
71G /pub/archive/fedora/linux/releases/7
55G /pub/archive/fedora/linux/updates/7
1.5G    /pub/archive/fedora/linux/updates/testing/7

The above takes all the directories we want to worry about and avoid /pub/alt which is a wild west of directories and data. I also want to avoid /pub/epel so I don’t get a mix between EPEL-7 and Fedora Linux 7. It also allows me to save that entire long grep into a file so I don’t repeat if for every time I do the next data manipulation which is:

# Thanks to https://gist.github.com/fsteffenhagen/e09b827430956d7f1de35140111e14c4
grep '/7' /tmp/dirs | numfmt --from=iec | awk 'BEGIN{sum=0} {sum=sum+$1} END{num=split($0,a,"/"); print sum,a[num]}' | numfmt --to=iec
128G 7

This uses a command numfmt that I wish I had known years before as I have ‘replicated’ it repeatedly poorly in awk and python. The first one converts it to an integer, then feeds it to awk which adds it, and then sums all that and prints the output. The conversion is lossy but ok for a quick blog post.

$ cat foobaz.sh 

for i in $( seq 7 35 ); do
     grep "/${i}$" /tmp/dirs | numfmt --from=iec | awk 'BEGIN{sum=0} {sum=sum+$1} END{num=split($0,a,"/"); print sum,a[num]}' | numfmt --to=iec | awk '{print $2","$1}'
$ bash foobaz.sh 

This first run found a problem because 35 should be greater than 200G. However only /pub/fedora/linux/updates/35 and /pub/fedora/linux/updates/testing/ are publically readable. Getting some data from root and we correct this to 35 having 917G. Plotting this in openoffice with some magic we get:

This is ok for a textual map but how about a graph picture. For this we remove the conversion to human readable data (aka M,G,T) and put the data into openoffice for some simple bar graphs. And so here is our data:


After this we can also look at how someone mirroring the distributions over time need more disk space:

The total growth looks to be move from exponential to linear over time. If you wanted to break out into smaller archives, you could put release 1 to 25 on one 10 TB drive, and 26 to 32 on another 10 TB drive as the releases after 26 are usually 1.4 TB in size at the end of their release cycle.


How to clone (a lot of) package spec files from CentOS 8 git.centos.org

Recently I have had to try and work out all the dependencies on a set of packages. I am writing this as a blog, as I needed to recreate work that I had done several times in the past, but past Smoogen had not fully documented. [I went looking and found I had 3 different copies of trees of packages from Fedora and CentOS but absolutely no notes and my bash_history had cycled over the 10k lines I normally keep. Past Smoogen was a BAD, BAD sysadmin.]

For general requires, I would do this by using dnf or dnf

$ dnf repoquery --requires bash
Last metadata expiration check: 2:44:24 ago on Fri 03 Sep 2021 10:14:22 EDT.
filesystem >= 3

However in this case, I needed to also work out all the buildrequires of the packages, and then the requires and buildrequires of those tools. Basically it is sort of building a buildroot for a set of leaf packages which usually means I need to get the spec files and parse them with something like spectool.

If I was working with Fedora, I would take the shallow git clone of the src.fedoraproject.org website which can be found at https://src.fedoraproject.org/lookaside/. Then I would start going down my list of 'known' software I need to work and clone out the usual 'buildroot' a fedpkg mockbuild of the package would give. However I am working with CentOS Stream 8 which is a slightly different repository layout, and does not have a prebuilt shallow clone.

Thankfully, the CentOS repository has a very useful sub-repository called https://git.centos.org/centos-git-common.git which contains all the tools to fetch the appropriate code and tools from the CentOS src repository. The first one I need work with is centos.git.repolist.py to query the pagure API and get a list of packages. I then need to clean up that list a bit because it contains some forks but the following will get me a complete list of the packages I want to parse:

[centos-git-common (master)]$ ./centos.git.repolist.py | grep -v '/forks/' > repolist-2021-09-03 [centos-git-common (master)]$ ./centos.git.repolist.py --namespace modules | grep -v '/forks/' >> repolist-2021-09-03 [centos-git-common (master)]$ sort -o repolist-2021-09-03 -u repolist-2021-09-03 [centos-git-common (master)]$ wc -l repolist-2021-09-03 8769 repolist-2021-09-03

That is a lot more packages than CentOS ships with as there are just over 2600 src.rpm packages in vault.centos.org for AppStream, BaseOS, and PowerTools. What is going on here?

It turns out there are two different events happening:

  1. Buildroot packages.
  2. SIG packages

Buildroot packages are the ones which are not shipped with EL8 but are needed to build EL8. [EL-8 was not meant to be build complete but just the packages that would be supportable by Red Hat.]. SIG packages are various ones which CentOS SIGs are supporting for projects like virtualization, hyperscale, and automotive. I may need to trace through all of these for my own package set, so I decided to clone them all first and then try to figure out what is needed afterwords.

# A silly script to clone all the repositories from git.centos.org in
# order to work out things like buildorder and other tasks.
# Set up our local repo place. I have lots of space on /srv


# loop over the namespaces we want to clone. It would have been nicer
# if there was a third namespace for sigs, but I don't think
# namespaces really happened when centos-7 was setting up git.centos.org.
for namespace in rpms modules; do
    mkdir -vp ${CLONE_DIR}/${namespace}
    cd ${CLONE_DIR}/${namespace}
    for repo in $( grep "/${namespace}/" ${CLONE_DIR}/${CLONE_LIST} ); do
        repodir=$( basename ${repo} )
        git clone ${repo}
        if [[ -d ${repodir} ]]; then
            pushd ${repodir} &> /dev/null
            X=`git branch --show-current`
            if [[ ${X} =~ 'c8s' ]]; then
                echo "${i} ${X}"
                git branch -a | grep c8s &> /dev/null
                if [[ $? -eq 0 ]]; then
                    echo "${repodir} ${X} xxx"
            popd &> /dev/null
        sleep 1

Running this takes a couple of hours, with a lot of errors about empty directories (for git repos which seem to have been created but never 'filled') and for non-existant branches (my script just looks for a c8s but some branches were called slightly differently than that.) In either case, I end up having the git repos I was trying to remember how I got earlier.