After the Firewalld issues were fixed, we still needed to get the RabbitMQ actually up and running again. My colleague who knows about the setup was otherwise engaged, so it was left as an exercise to me. Sadly, I was failing the exercise…

It took not much time to figure out there was a bug in the entrypoint script of our RabbitMQ container - how did this ever work? But even with the bug fixed, it was still not working. I googled for what was probably the most frequent error message (nodedown).

Then my colleague was back and immediately saw what was wrong. Apparently our RabbitMQ image, named “rabbitmq-34”, was running v3.6, had been rebuilt recently in our CI with version 3.5 (as defined in the Dockerfile) and because of my actions, I pulled the newer image with the older version (but still not 3.4, mind you ;-))

The data stored locally was (rightfully) not accepted by the daemon, and thus the processes threw errors. The errors where not clear to me (or I was looking in the wrong places), and there was really no reason I should have realized the issue was with the versions.

Remains the question, how did this new version get in our registry, and how did this ever work considering the bug I found. To investigate, I logged onto our protection servers and verified. Those were running a different image, “rabbitmq-34:production”. Entering the container, I saw the entrypoint script was bug free, and the version of rabbitmq was 3.5.

Part of the mystery was solved: at some point, someone manually built an image with the entrypoint script fixed and v3.5, pushed to our test tag and upgraded the test cluster with the new image. They probably verified the test cluster and then pushed the test tag to production and upgraded production. However, they “forgot” to commit their changes in the git repo used by our Jenkins.

At some later point, someone (same or different person) probably decided to give RabbitMQ 3.6 a try, manually built an image for it and pushed it to out test tag and deployed it on our test cluster.

Recently, Jenkins had some reason to rebuild the test image without the bugfix. My messing around yesterday got that broken image deployed on our test cluster and now I was stuck with an older version of RabbitMQ with a buggy entrypoint script - built exactly according to the git repository :-P

Takeaway:

  • spend more time urging my team to commit and push changes that they want to keep
  • explain that, by default, git submodules “track” a revision, and not “master” or any other branch
    • changes to the submodule will not be picked up by eg. Jenkins unless the parent repo is explicitly updated too
  • don’t push manually to the registry - let Jenkins do this
    • I should urgently get acls on our production registry