Installed top5000 Python packages on AArch64

It was yet another boring week. I got some thanks for work we did to get Tensorflow running on AArch64 and was asked is there any other Python project which could use our help.

I had some projects already on my list of things to check. And then found “Top PyPI Packages” website…

Let install 5000 Python packages

So I got an idea — let me grab list of top5000 PyPI packages and check how many of them have issues on AArch64.

The plan was simple. Loop over list and do 3 tasks:

This way each package had the same environment and I did not had to worry about version conflicts.

Virtualenv preparation

To not repeat creation of virtualenv five thousand times I did it once:

/opt/python/cp39-cp39/bin/python3 -mvenv venvs/test
. venvs/test/bin/activate
pip install -U pip wheel setuptools
deactivate
cp -a venvs/test venvs/clean

This way I had newest versions of “pip”, “wheel” and “setuptools” as part of virtualenv. For a bit of speed “venvs” directory was mounted as “tmpfs”.

Everything happened inside of “manylinux2014_aarch64” container (so CentOS 7 as OS) with Python 3.9 as interpreter.

Phase 1

First phase was simple installation of package wheel as it was available on PyPI:

pip --only-binary :all: --no-compile ${name_of_package}

So if package provided only source tarball/zip then it was marked as failed.

There were 1569 packages which failed to pass this phase. Common issues (other than missing some development headers):

Note that failure at this phase is allowed as I just want ready to use wheel files.

Whole process took about 20 hours on Honeycomb.

Phase 2

The main difference was getting rid of “—only-binary” and “—no-compile” options from pip install calls.

Still no additional development packages installed. Cache from phase 1 in use to not re-download/re-build existing wheel files.

The main issue is how single threaded pip install is. Nevermind that Honeycomb has 16 cpu cores — only one is used (and this is Cortex-A72 so nothing fancy). This makes building times higher than they suppose to be:

Building wheels for collected packages: pandas, typing
  Building wheel for pandas (setup.py): started
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...
  Building wheel for pandas (setup.py): still running...

There were 313 packages which failed to pass this phase. Issues were similar to those in phase 1 with one exception (as building packages was allowed):

This phase took about 13 hours on Honeycomb.

Phase 3

About 6% packages left. Now it is time to install some development headers:

I created this list by checking how packages failed to build. It should be longer but CentOS 7 (base of “manylinux2014” container image) does not provide everything needed (for example up-to-date Rust compiler or LLVM).

Before starting phase 3 run I removed all entries related to “pyobjc” as they are MacOS related so there is no need to waste time again.

After 3.5 hours I had another 54 packages built.

Phase 4

Some packages are not present in CentOS 7 but are present in EPEL repository. So after enabling EPEL (yum install -y epel-release) I installed another set of development packages:

Some of those packages should be installed in previous step. I did not caught them because build processes failed earlier.

Before starting round I went through logs and removed everything:

At the end I had about one hundred of failed to build packages. For different reasons:

Summary

One hundred of top five thousand packages equals two percent of failures. There were 13 failures in top 1000, another 14 in second thousand.

Is 2% acceptable amount? I think that it is. Some improvements can still be made but nothing requiring shown. OK, would be nice to get Tensorflow for AArch64 released by upstream under same name (instead of “tensorflow_aarch64” builds done by team at Linaro).

How to run it?

After my tweet I had several comments and people wanted to run this test on other architectures, operating systems or devices. So I wrote simple script:

#!/bin/bash

echo "cleanup after previous runs"
rm -rf venvs/* logs/*

echo "Prepare clean virtualenv"
python3 -mvenv venvs/test
. venvs/test/bin/activate
pip install -U pip wheel setuptools
deactivate
cp -a venvs/test venvs/clean

echo "fetch and prepare top5000 list"
rm top-pypi-packages-30-days.*
wget https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json
grep project top-pypi-packages-30-days.json \
    |sed -e 's/"project": "\(.*\)"/\1/g' > top-pypi-packages-30-days.text

echo "go through packages"
mkdir -p logs
for package in `cat top-pypi-packages-30-days.text`; do
    echo "processing ${package}"
    rm -rf venvs/test
    cp -a venvs/clean venvs/test
    source venvs/test/bin/activate
    pip install --no-input \
        -U --upgrade-strategy=only-if-needed \
        $package | tee logs/${package}.log
    deactivate
    echo "-----------------------------------------------------------------"
done

It should work on any operating system capable of running Python. All build dependencies need to be installed first. I suggest mounting “tmpfs” over “venvs/” directory as there will be lot of temporary i/o going on there.

Once it finish just run grep to check how many packages were installed with success:

grep "^Successfully installed" logs/*|wc -l

Please share your results. Contact page lists several ways to catch me.

aarch64 linaro python