The release engineering team triages tasks flagged Release-Engineering-Team on a weekly basis. It is an all hands on deck one hour meeting in which we pick tasks one by one and find out what to do with them. We have started with more than a hundred of them and are now down to just a dozen or so, most filed since the last meeting.
I have been doing those routine triages for the projects I closely manage, often on Friday afternoon. I have recently started being a bit more serious about it and even allocated a couple weeks entirely dedicated to act on the backlog. This post summarizes some of my discoveries, will hopefully inspire the reader to tackle their own backlogs, technical debt and hopefully in the end we will have improved our ecosystem.
Tasks you have filed
I keep filing tasks rather than taking notes or writing emails, I find Phabricator interface convenient since it lets me flag a task with labels however I want (Technical-Debt , Documentation, MediaWiki-General), subscribe individuals or even a whole team. It is great. With time those tasks pill up and it is easy to forget old ones, they have to be revisited from time to time. It as easy I searching for any open tasks I have filed and order them by creation date:
|Order By||Creation (oldest First)|
The first bug in the list is the oldest you have created and most probably deserve to be acted on. From there pick the tasks one by one.
Some will surely be obsolete since they have been acted on or the underlying infrastructure entirely changed. An example of a 6 years old task I declined is T100099, it followed a meeting to deploy MediaWiki services to Beta-Cluster-Infrastructure . The task has been partially achieved for a few services (notably Parsoid) and was left open since we never moved all services to the same system. Nowadays developers deploy a Docker image and restart the Docker container. The notes are obsolete and the task has thus no purpose anymore.
T149924 came from deploying static web assets using git directly to /srv. However the partition also hosted dynamically generated content such as all the content from https://doc.wikimedia.org/ , https://integration.wikimedia.org/ or state from a CI daemon. The issue is problematic when we reimage the server, specially during OS upgrades which we do every two years, and the task history reflect that:
- Filed in 2016 after an OS upgrade
- The part affecting https://integration.wikimedia.org/ is partially addressed in 2018 as part of an OS upgrade
- In 2020 we had yet another OS upgrade and this time I decided to complete the task
I completed it because that task showed up in my list of oldest bugs, it thus kept showing up whenever I did the triage and that was an incentive to get it gone. We are in a much better shape, the services have been decoupled on different machines, the static assets are deployed using our deployment tool: Scap.
Check your projects
Beside your team projects, you surely have side pet projects or legacy tags you might want to revisit. They can be found in search for your projects you are a member of (assuming you made yourself a member): https://phabricator.wikimedia.org/project/query/JS0zmX.yalpI/#R
I for example introduced Doxygen to generate the MediaWiki PHP documentation, git-review to assist interactions with Gerrit for which bugs are tracked in a column of the Gerrit project, and I am probably the one one actively acting on this task.
You can again list tasks filed against each project sorted by creation dates, and since you are a member of the project you will most probably be able to act on those old tasks.
One of the oldest tasks I had was T48148, which is to hide CI or robot comments from Gerrit change. The task has been filed in 2013, I found the upstream proposed solution back in 2019 and well *cough* forgot about it. Since I encountered the task during a triage, I went to tackle it and in short the required code boils down to add a single line in the CI configuration:
gerrit: verified: 2 + tag: autogenerated:ci
That took almost 9 months, since I was not actively triaging old tasks.
Just like we have the generic Documentation tag for any tasks relating to documentation, we have Technical-Debt to mark a task as requiring an extra effort and bring us to modernity. When triaging your own or your projects tasks, you can flag them as technical debt to easily find them later on.
Some tasks can immediately be filed as being a technical debt, that was the case of T141324 which is to send logging of the Gerrit code review system to logstash and thus make them easier to dig through or discover. Sounds simple? Well not that much.
The story is a bit complicated, but in short Gerrit is a java application and our team does not necessarily have much experience with it, the state of Java logging is a bit unclear (Gerrit uses log4j). Luckily we had some support from actual Java developers and managed to do some injecting, though the fields were not properly formatted, it was a progress.
After I got assigned as the primary maintainer of our Gerrit setup, I definitely needed proper logging. When we upgraded Gerrit to 3.2, the library we used to format the logs to Json was no longer provided by upstream, forcing us to maintain a fork of Gerrit just for that purpose.
Luckily upstream has made improvements and I found out it supports json logging out of the box while our logging infrastructure learned to ingest json logs. We even got as far as supporting Elastic Common Schema to use predefined field names.
That task has been a technical debt for 5 years, but since I kept seeing it I kept remembering about it and managed to address it.
Some tasks can not be acted on cause they depend on an upstream change that might be delayed for some reasons. A massive issue we have encountered since at least 2015 was slowness when doing a git fetch from our busiest repository. I previously blogged about it Blog Post: Faster source code fetches thanks to git protocol version 2 and Google addressed it by proposing a version 2 of the git protocol. It was one of the incentives for us to upgrade Gerrit, and as soon as we upgraded I made a point to test the fix and make it well known to our developers (do use protocol.version=2 in your .gitconfig).
When processing old tasks, you can find it hard to tackle ones that need to focus for a few days if not weeks as in the example above. But there are also a bunch of little annoying tasks that are surprisingly very easy to solve and give immediate reward. The positive feedback loop would get you in the mood of finding more easy tasks and thus reducing your backlog. A few more examples:
T221510, filed in 2019 and addressed two years later, was requesting to expose a machine readable test coverage report. The file was there (clover.xml) it was simply not exposed in the web page, a simple <a href="clover.xml">clover.xml</a> is the only thing that was required.
My favorite tasks are obviously the ones that already have been solved and are just pending the paperwork to mark them resolved. T138653 was for a user unable to login to Gerrit due to a duplicate account, 3 years after it had been filed the user reported he was able to login properly and I marked it resolved one hour later. I guess that user was grooming their old tasks as well.
And finally, some old tasks might not be worth fixing. We are probably too kind with those and should probably be more strict in declining very old tasks. An example is T63733, the MediaWiki source code is deployed to the Wikimedia production cluster under a directory named php-<version>. Surely the php- prefix does not offer any meaningful information. However, since it is hardcoded in various places and would require moving files around on the whole fleet of servers, it might be a bit challenging and would definitely be a risky change. Should we drop that useless prefix? For sure. Is it worth facing outage and possibly multiple degraded services? Definitely not and I have thus just declined it.
Great post @hashar! I propose a friendly competition about the oldest task that one has created that's still open (https://phabricator.wikimedia.org/maniphest/query/Wws2E0C7IaFd/#R ). For me it's T59302 from 2013.