From a diary of AArch64 porter — firefighting

This post is part 9 of the "From a diary of AArch64 porter" series:

  1. From a diary of AArch64 porter — autoconf
  2. From a diary of AArch64 porter — rpm packaging
  3. From a diary of AArch64 porter — testsuites
  4. From a diary of AArch64 porter — POSIX.1 functionality
  5. From a diary of AArch64 porter — PAGE_SIZE
  6. From a diary of AArch64 porter — vfp precision
  7. From a diary of AArch64 porter — system calls
  8. From a diary of AArch64 porter — parallel builds
  9. From a diary of AArch64 porter — firefighting
  10. From a diary of AArch64 porter — drive-by coding
  11. From a diary of AArch64 porter — manylinux2014
  12. From a diary of AArch64 porter — handling big patches
  13. From a diary of AArch64 porter — Arm CPU features table

When I was a kid there was a children’s book about Wojtek who wanted to be firefighter. It is part of culture for my generation.

I never wanted to follow Wojtek’s dreams. But during last years I became firefighter. And this is not a good thing in a long term.

CI failures

During last months we (Linaro) took care of AArch64 support in OpenStack infrastructure. There are nodes with CentOS 7 and 8, Debian ‘stretch’ and ‘buster, Ubuntu ‘xenial’, ‘bionic’ and ‘focal’. And several CI jobs in some projects (Disk Image Builder, Kolla, Nova and some other).

And those CI jobs tend to fail. As usual, right? Not quite…

Missing Python packages

One day when I joined irc in the morning I was greeted with “aarch64 ci fails — can you take a look?” from one of project developers.

Quick look into logs:

ERROR: Could not find a version that satisfies the requirement ntplib (from versions: none)
ERROR: No matching distribution found for ntplib

Then usual path. Pypi, check release, history, go to homepage, fill issue. Curse. And wait for upstream to fix problem. They fixed, CI was working again.

Last Monday — I started work ready to do something interesting. And then was greeted with same story “aarch64 ci fails, can you take a look?”.

ERROR: Could not find a version that satisfies the requirement protobuf==3.12.0 
(from versions: 2.0.0b0, 2.0.3, 2.3.0, 2.4.1, 2.5.0, 2.6.0, 2.6.1, 3.0.0a2,
3.0.0a3, 3.0.0b1, 3.0.0b1.post1, 3.0.0b1.post2, 3.0.0b2, 3.0.0b2.post1,
3.0.0b2.post2, 3.0.0b3, 3.0.0b4, 3.0.0, 3.1.0, 3.1.0.post1, 3.2.0rc1,
3.2.0rc1.post1, 3.2.0rc2, 3.2.0, 3.3.0, 3.4.0, 3.5.0.post1, 3.5.1, 3.5.2,
3.5.2.post1, 3.6.0, 3.6.1, 3.7.0rc2, 3.7.0rc3, 3.7.0, 3.7.1, 3.8.0rc1, 3.8.0,
3.9.0rc1, 3.9.0, 3.9.1, 3.9.2, 3.10.0rc1, 3.10.0, 3.11.0rc1, 3.11.0rc2, 3.11.0,
3.11.1, 3.11.2, 3.11.3)

Pypi, check, homepage, there was an issue already filled. So far no upstream response.

Problem got solved by moving all OpenStack projects to previous (3.11.3) release.

Missing/deprecated versions in distributions

So I started work on adding another CI job. This time for ‘requirements’ OpenStack project. To make sure that whatever Python package upgrade will be available on AArch64 as well.

As usual had to add a pile of distro dependencies to get ‘numpy’ and ‘scipy’ built correctly. And bump timeout to 2 hours. Build was going nice.

And then ‘confluent_kafka’ hit hard:

/tmp/pip-install-ld4fzu94/confluent-kafka/confluent_kafka/src/confluent_kafka.h:65:2: 
error: #error "confluent-kafka-python requires librdkafka v1.4.0 or later.
Install the latest version of librdkafka from the Confluent repositories, 
see http://docs.confluent.io/current/installation.html"                                                            

librdkafka v1.4.0 or later” is not available in any distribution used on OpenStack infra nodes. Fantastic!

And repositories mentioned in error message are x86-64 only. Fun!

Sure, can do usual pypi, release, homepage, issue. Even went that way. Chain of GitHub projects building components into repositories. x86-64 only all the way.

External CI services

Which gets us to another thing — external CI services. You know: Azure pipelines, GitHub actions or Travis CI (alphabetical order). Used in misc ways by most FOSS projects nowadays.

Each of them has some kind of AArch64 support nowadays. But it looks like only Travis provides you with hardware. Azure and GitHub only can connect your external machines to their CI service.

Speed or lack of it

So you have a project where you need AArch64 support and upstream is already using Travis for their CI needs. Lucky you!

You work with project developers you get test suite running and then it time outs. It does not matter that CI machine you got have few cores because ‘pytest’ does not know how to run tests in parallel.

So you cut tests completely or partially. Or just abandon the idea.

No hardware, no binaries

If you are less lucky then you may get such answer from upstream. I had such in past few times. And I fully understand them — why support something when you can not even test does it work?

Firefighting or yak shaving?

When I discussed it with friends one of them mentioned that this reminds him more yak shaving than firefighting. To be honest it is both most of time.

aarch64 debian development fedora linaro python