Speeding up package management.
Package management enables users to easily install software while the package manager handles locating and downloading all the dependencies of the software. Software itself is stored in so called “Package Repositories” which typically consist of:
- The software packages themselves
The metadata lets the package manager know what software exists within the repository and information about each piece of software. Such as which other packages an individual package depends on, along with descriptions and package contents.
In RPM based distros the rpm-md (rpm metadata) metadata format is becoming fairly standard, although there are many other formats.
When the package manager is to be used it must (either manually or automatically) check whether any of the available repositories have been updated. If they have been updated it must update its local cache of the repository metadata so that it knows the correct locations of packages. This update operation can be slow (especially for dial-up users) as the metadata for a large repository can be quite large, particularly with the rpm-md format of [binary] xml. For example the main openSUSE repository has 55mb of compressed rpm-md metadata, and that’s with the source and non-free packages stripped out.
This slow repository refresh operation becomes most annoying when the package manager is dealing with largish repositories which change frequently, such as the online update repository, packman repository, and guru repositories for SUSE. For example: Last night the guru repository was updated with a half-dozen new packages. This would mean a 1.7mb download of metadata for all users subscribed to this repository so that their package manager knows that these 6 new packages are there.
Most of the data downloaded in each refresh is often redundant, the package manager must re-download it all to know that 99.9% of the packages are exactly the same as they were last time it checked. Clearly it would be preferable to only download the differences in metadata. Various solutions have been considered. Duncan is looking at zsync which looks very interesting. Intrigued at how much improvement users might see from metadata diffs I thought I’d investigate the potential gains.
Using last night’s guru metadata change as an example I wrote a simple utility to read the xml and create a new xml file containing those packages which were added, and those which were deleted in the last update. This turned out to be trivial as rpm-md identifies each package by its checksum. The result:
Complete metadata (currently downloaded on each update: 1.7mb total
Differences to last metadata (in same xml format): 5.4kb total.
Time to refresh on dialup would decrease from about 5 mins, to less than a second.
The time to refresh a repository would be proportional to the number of changes, rather than the size of the repository.
Diffs would either have to be maintained on the server from every revision of the package repository to the current revision, or just have a new diff created on each update from the previous version and the client side could chain the diffs together to work out the cumulative changes.
Unfortunately this method would not be beneficial to repositories where everything changes at once (e.g. factory, build service KDE). Zsync should provide a better all-purpose solution. This experiment has enlightened me as to the potential benefits of metadata diffing though.
The input and output files, and code used Please note this was only written to investigate the potential gains and is not suitable for actual use.
There are also other simpler ways that the amount of data downloaded can be reduced.
- Reduce data downloaded
Much of the data such as the package changelogs is irrelevant to the package management operation and few users are likely to need to view changelogs of packages which are not yet installed.
- Include metadata on the install media
The metadata for the static repositories could be shipped along with the install media, so the user doesn’t have to download 50mb of data at the end of the installation process.