2021-09-25

Growth of Fedora Distribution over Time

Growth of the Fedora Distribution over time

There was a conversation in IRC (libera.chat, #fedora-admin) on the amount of disk space that Fedora is using over time. It used to grow astronomically over time, but there was an idea that it might be slowing down.. and then the realization that no one had graphed it. Taking this challenge in hand I decided to look at it. Doing a complete mirror of the data would require me to have a very long time frame and 100+ TB of disk space, but luckily for me, the Fedora mirror system does a du every night and outputs this data to a file, https://dl.fedoraproject.org/pub/DIRECTORY_SIZES.txt

The file covers all the directories that the main download servers have including the archive trees which are where old releases go to live. It also puts it in a ‘human-readable’ format like

egrep '/rawhide$|/releases/[0-9]*$|/updates/[0-9]*$|/updates/testing/[0-9]*$' DIRECTORY_SIZES.txt | egrep -v '^8.0K|^12K|^4.0K|/pub/epel|/pub/alt' > /tmp/dirs
$ grep '/7' /tmp/dirs 
71G /pub/archive/fedora/linux/releases/7
55G /pub/archive/fedora/linux/updates/7
1.5G    /pub/archive/fedora/linux/updates/testing/7

The above takes all the directories we want to worry about and avoid /pub/alt which is a wild west of directories and data. I also want to avoid /pub/epel so I don’t get a mix between EPEL-7 and Fedora Linux 7. It also allows me to save that entire long grep into a file so I don’t repeat if for every time I do the next data manipulation which is:

# Thanks to https://gist.github.com/fsteffenhagen/e09b827430956d7f1de35140111e14c4
grep '/7' /tmp/dirs | numfmt --from=iec | awk 'BEGIN{sum=0} {sum=sum+$1} END{num=split($0,a,"/"); print sum,a[num]}' | numfmt --to=iec
128G 7

This uses a command numfmt that I wish I had known years before as I have ‘replicated’ it repeatedly poorly in awk and python. The first one converts it to an integer, then feeds it to awk which adds it, and then sums all that and prints the output. The conversion is lossy but ok for a quick blog post.

$ cat foobaz.sh 
#/bin/bash

for i in $( seq 7 35 ); do
     grep "/${i}$" /tmp/dirs | numfmt --from=iec | awk 'BEGIN{sum=0} {sum=sum+$1} END{num=split($0,a,"/"); print sum,a[num]}' | numfmt --to=iec | awk '{print $2","$1}'
done
$ bash foobaz.sh 
7,128G
8,153G
9,286G
10,207G
11,266G
12,267G
13,202G
14,229G
15,371G
16,388G
17,594G
18,669G
19,600G
20,639G
21,730G
22,804G
23,865G
24,816G
25,821G
26,971G
27,1.1T
28,1.2T
29,1.1T
30,1.3T
31,1.1T
32,1.2T
33,1.3T
34,1.3T
35,200G

This first run found a problem because 35 should be greater than 200G. However only /pub/fedora/linux/updates/35 and /pub/fedora/linux/updates/testing/ are publically readable. Getting some data from root and we correct this to 35 having 917G. Plotting this in openoffice with some magic we get:

This is ok for a textual map but how about a graph picture. For this we remove the conversion to human readable data (aka M,G,T) and put the data into openoffice for some simple bar graphs. And so here is our data:


 

After this we can also look at how someone mirroring the distributions over time need more disk space:


The total growth looks to be move from exponential to linear over time. If you wanted to break out into smaller archives, you could put release 1 to 25 on one 10 TB drive, and 26 to 32 on another 10 TB drive as the releases after 26 are usually 1.4 TB in size at the end of their release cycle.
 

1 comment:

Unknown said...

IJWTS that I also have recently discovered `numfmt` after reproducing it badly in all kinds of languages over the years.