Installed top5000 Python packages on AArch64

It was yet another boring week. I got some thanks for work we did to get Tensorflow running on AArch64 and was asked is there any other Python project which could use our help.

I had some projects already on my list of things to check. And then found “Top PyPI Packages” website…

Let install 5000 Python packages

So I got an idea — let me grab list of top5000 PyPI packages and check how many of them have issues on AArch64.

The plan was simple. Loop over list and do 3 tasks:

create virtualenv
install package
destroy virtualenv

This way each package had the same environment and I did not had to worry about version conflicts.

Virtualenv preparation

To not repeat creation of virtualenv five thousand times I did it once:

/opt/python/cp39-cp39/bin/python3 -mvenv venvs/test
. venvs/test/bin/activate
pip install -U pip wheel setuptools
deactivate
cp -a venvs/test venvs/clean

This way I had newest versions of “pip”, “wheel” and “setuptools” as part of virtualenv. For a bit of speed “venvs” directory was mounted as “tmpfs”.

Everything happened inside of “manylinux2014_aarch64” container (so CentOS 7 as OS) with Python 3.9 as interpreter.

Phase 1

First phase was simple installation of package wheel as it was available on PyPI:

pip --only-binary :all: --no-compile ${name_of_package}

So if package provided only source tarball/zip then it was marked as failed.

There were 1569 packages which failed to pass this phase. Common issues (other than missing some development headers):

INFO: pip is looking at multiple versions of PACKAGE_NAME to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
ERROR: Cannot install OTHER_PACKAGE_NAME because these package versions have conflicting dependencies.
ERROR: Could not find a version that satisfies the requirement OTHER_PACKAGE_NAME (from versions: none)
ERROR: Could not find a version that satisfies the requirement OTHER_PACKAGE_NAME==N.V.R (from ANOTHER_PACKAGE_NAME) (from versions: x.y, x.y.z, x.z.z)
ERROR: No matching distribution found for OTHER_PACKAGE_NAME

Note that failure at this phase is allowed as I just want ready to use wheel files.

Whole process took about 20 hours on Honeycomb.

Phase 2

The main difference was getting rid of “—only-binary” and “—no-compile” options from pip install calls.

Still no additional development packages installed. Cache from phase 1 in use to not re-download/re-build existing wheel files.

The main issue is how single threaded pip install is. Nevermind that Honeycomb has 16 cpu cores — only one is used (and this is Cortex-A72 so nothing fancy). This makes building times higher than they suppose to be:

Building wheels for collected packages: pandas, typing
  Building wheel for pandas (setup.py): started
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...

There were 313 packages which failed to pass this phase. Issues were similar to those in phase 1 with one exception (as building packages was allowed):

ERROR: Could not build wheels for OTHER_PACKAGE_NAME, which is required to install pyproject.toml-based projects

This phase took about 13 hours on Honeycomb.

Phase 3

About 6% packages left. Now it is time to install some development headers:

blas-devel
bzip2-devel
cairo-devel
cyrus-sasl-devel
gmp-devel
gobject-introspection-devel
graphviz-devel
gtk3-devel
httpd-devel
krb5-devel
lapack-devel
libcap-devel
libcurl-devel
libicu-devel
libjpeg-devel
libmemcached-devel
mariadb-devel
ncurses-devel
openldap-devel
openssl-devel
poppler-cpp-devel
postgresql-devel
protobuf-compiler
unixODBC-devel
xmlsec1-devel

I created this list by checking how packages failed to build. It should be longer but CentOS 7 (base of “manylinux2014” container image) does not provide everything needed (for example up-to-date Rust compiler or LLVM).

Before starting phase 3 run I removed all entries related to “pyobjc” as they are MacOS related so there is no need to waste time again.

After 3.5 hours I had another 54 packages built.

Phase 4

Some packages are not present in CentOS 7 but are present in EPEL repository. So after enabling EPEL (yum install -y epel-release) I installed another set of development packages:

augeas-devel
boost-devel
cargo
gdal-devel
leptonica-devel
leveldb-devel
suitesparse-devel
portaudio-devel
proj
protobuf-devel
rust
zbar-devel

Some of those packages should be installed in previous step. I did not caught them because build processes failed earlier.

Before starting round I went through logs and removed everything:

failed with “No matching distribution for PACKAGE_NAME“
failed with “use_2to3 is invalid” (aka “I need old setuptools”)
requiring Bazel
requiring tensorflow

At the end I had about one hundred of failed to build packages. For different reasons:

missing build dependencies
expecting newer libraries than “manylinux2014” (CentOS 7) has
not listing all dependencies (everyone has “numpy” installed, right?)
being Python 2.7 only
using removed modules or classes
breaking install to say “this module is deprecated, use OTHER_NAME”
not supporting AArch64 architecture

Summary

One hundred of top five thousand packages equals two percent of failures. There were 13 failures in top 1000, another 14 in second thousand.

Is 2% acceptable amount? I think that it is. Some improvements can still be made but nothing requiring shown. OK, would be nice to get Tensorflow for AArch64 released by upstream under same name (instead of “tensorflow_aarch64” builds done by team at Linaro).

How to run it?

After my tweet I had several comments and people wanted to run this test on other architectures, operating systems or devices. So I wrote simple script:

#!/bin/bash

echo "cleanup after previous runs"
rm -rf venvs/* logs/*

echo "Prepare clean virtualenv"
python3 -mvenv venvs/test
. venvs/test/bin/activate
pip install -U pip wheel setuptools
deactivate
cp -a venvs/test venvs/clean

echo "fetch and prepare top5000 list"
rm top-pypi-packages-30-days.*
wget https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json
grep project top-pypi-packages-30-days.json \
    |sed -e 's/"project": "\(.*\)"/\1/g' > top-pypi-packages-30-days.text

echo "go through packages"
mkdir -p logs
for package in `cat top-pypi-packages-30-days.text`; do
    echo "processing ${package}"
    rm -rf venvs/test
    cp -a venvs/clean venvs/test
    source venvs/test/bin/activate
    pip install --no-input \
        -U --upgrade-strategy=only-if-needed \
        $package | tee logs/${package}.log
    deactivate
    echo "-----------------------------------------------------------------"
done

It should work on any operating system capable of running Python. All build dependencies need to be installed first. I suggest mounting “tmpfs” over “venvs/” directory as there will be lot of temporary i/o going on there.

Once it finish just run grep to check how many packages were installed with success:

grep "^Successfully installed" logs/*|wc -l

Please share your results. Contact page lists several ways to catch me.