It was yet another boring week. I got some thanks for work we did to get Tensorflow running on AArch64 and was asked is there any other Python project which could use our help.
I had some projects already on my list of things to check. And then found “Top PyPI Packages” website…
Let install 5000 Python packages
So I got an idea — let me grab list of top5000 PyPI packages and check how many of them have issues on AArch64.
The plan was simple. Loop over list and do 3 tasks:
- create virtualenv
- install package
- destroy virtualenv
This way each package had the same environment and I did not had to worry about version conflicts.
Virtualenv preparation
To not repeat creation of virtualenv five thousand times I did it once:
/opt/python/cp39-cp39/bin/python3 -mvenv venvs/test
. venvs/test/bin/activate
pip install -U pip wheel setuptools
deactivate
cp -a venvs/test venvs/clean
This way I had newest versions of “pip”, “wheel” and “setuptools” as part of virtualenv. For a bit of speed “venvs” directory was mounted as “tmpfs”.
Everything happened inside of “manylinux2014_aarch64” container (so CentOS 7 as OS) with Python 3.9 as interpreter.
Phase 1
First phase was simple installation of package wheel as it was available on PyPI:
pip --only-binary :all: --no-compile ${name_of_package}
So if package provided only source tarball/zip then it was marked as failed.
There were 1569 packages which failed to pass this phase. Common issues (other than missing some development headers):
INFO: pip is looking at multiple versions of PACKAGE_NAME to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
ERROR: Cannot install OTHER_PACKAGE_NAME because these package versions have conflicting dependencies.
ERROR: Could not find a version that satisfies the requirement OTHER_PACKAGE_NAME (from versions: none)
ERROR: Could not find a version that satisfies the requirement OTHER_PACKAGE_NAME==N.V.R (from ANOTHER_PACKAGE_NAME) (from versions: x.y, x.y.z, x.z.z)
ERROR: No matching distribution found for OTHER_PACKAGE_NAME
Note that failure at this phase is allowed as I just want ready to use wheel files.
Whole process took about 20 hours on Honeycomb.
Phase 2
The main difference was getting rid of “—only-binary” and “—no-compile”
options from pip install
calls.
Still no additional development packages installed. Cache from phase 1 in use to not re-download/re-build existing wheel files.
The main issue is how single threaded pip install
is. Nevermind that
Honeycomb has 16 cpu cores — only one is used (and this is Cortex-A72 so
nothing fancy). This makes building times higher than they suppose to be:
Building wheels for collected packages: pandas, typing
Building wheel for pandas (setup.py): started
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
Building wheel for pandas (setup.py): still running...
There were 313 packages which failed to pass this phase. Issues were similar to those in phase 1 with one exception (as building packages was allowed):
ERROR: Could not build wheels for OTHER_PACKAGE_NAME, which is required to install pyproject.toml-based projects
This phase took about 13 hours on Honeycomb.
Phase 3
About 6% packages left. Now it is time to install some development headers:
- blas-devel
- bzip2-devel
- cairo-devel
- cyrus-sasl-devel
- gmp-devel
- gobject-introspection-devel
- graphviz-devel
- gtk3-devel
- httpd-devel
- krb5-devel
- lapack-devel
- libcap-devel
- libcurl-devel
- libicu-devel
- libjpeg-devel
- libmemcached-devel
- mariadb-devel
- ncurses-devel
- openldap-devel
- openssl-devel
- poppler-cpp-devel
- postgresql-devel
- protobuf-compiler
- unixODBC-devel
- xmlsec1-devel
I created this list by checking how packages failed to build. It should be longer but CentOS 7 (base of “manylinux2014” container image) does not provide everything needed (for example up-to-date Rust compiler or LLVM).
Before starting phase 3 run I removed all entries related to “pyobjc” as they are MacOS related so there is no need to waste time again.
After 3.5 hours I had another 54 packages built.
Phase 4
Some packages are not present in CentOS 7 but are present in EPEL repository. So
after enabling EPEL (yum install -y epel-release
) I installed another set of
development packages:
- augeas-devel
- boost-devel
- cargo
- gdal-devel
- leptonica-devel
- leveldb-devel
- suitesparse-devel
- portaudio-devel
- proj
- protobuf-devel
- rust
- zbar-devel
Some of those packages should be installed in previous step. I did not caught them because build processes failed earlier.
Before starting round I went through logs and removed everything:
- failed with “
No matching distribution for PACKAGE_NAME
“ - failed with “
use_2to3 is invalid
” (aka “I need old setuptools”) - requiring Bazel
- requiring tensorflow
At the end I had about one hundred of failed to build packages. For different reasons:
- missing build dependencies
- expecting newer libraries than “manylinux2014” (CentOS 7) has
- not listing all dependencies (everyone has “numpy” installed, right?)
- being Python 2.7 only
- using removed modules or classes
- breaking install to say “this module is deprecated, use OTHER_NAME”
- not supporting AArch64 architecture
Summary
One hundred of top five thousand packages equals two percent of failures. There were 13 failures in top 1000, another 14 in second thousand.
Is 2% acceptable amount? I think that it is. Some improvements can still be made but nothing requiring shown. OK, would be nice to get Tensorflow for AArch64 released by upstream under same name (instead of “tensorflow_aarch64” builds done by team at Linaro).
How to run it?
After my tweet I had several comments and people wanted to run this test on other architectures, operating systems or devices. So I wrote simple script:
#!/bin/bash
echo "cleanup after previous runs"
rm -rf venvs/* logs/*
echo "Prepare clean virtualenv"
python3 -mvenv venvs/test
. venvs/test/bin/activate
pip install -U pip wheel setuptools
deactivate
cp -a venvs/test venvs/clean
echo "fetch and prepare top5000 list"
rm top-pypi-packages-30-days.*
wget https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json
grep project top-pypi-packages-30-days.json \
|sed -e 's/"project": "\(.*\)"/\1/g' > top-pypi-packages-30-days.text
echo "go through packages"
mkdir -p logs
for package in `cat top-pypi-packages-30-days.text`; do
echo "processing ${package}"
rm -rf venvs/test
cp -a venvs/clean venvs/test
source venvs/test/bin/activate
pip install --no-input \
-U --upgrade-strategy=only-if-needed \
$package | tee logs/${package}.log
deactivate
echo "-----------------------------------------------------------------"
done
It should work on any operating system capable of running Python. All build dependencies need to be installed first. I suggest mounting “tmpfs” over “venvs/” directory as there will be lot of temporary i/o going on there.
Once it finish just run grep to check how many packages were installed with success:
grep "^Successfully installed" logs/*|wc -l
Please share your results. Contact page lists several ways to catch me.