Page MenuHomePhabricator

Grafana "MW deploy" "Train deployments" annotations broken on some dashboards
Open, Needs TriagePublic

Description

Some dashboards have checkbox to overlay deployment events (MW deploy and/or Train deployments). I have noticed today they do no work anymore. Example dashboards:

The https://grafana-rw.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?editview=annotations annotations for application-servers-red-k8s shows both annotations are disabled:

grafana_annotations_deployments.png (334×933 px, 38 KB)

They both mention Graphite as a datasource which I think is to be decommissioned (T228380). They query field is empty.

Event Timeline

colewhite renamed this task from Graphana no more shows "MW deploy" "Train deployments" annotations to Grafana "MW deploy" "Train deployments" annotations broken on some dashboards.Tue, Dec 2, 4:05 PM
colewhite subscribed.

The replacement for this annotation tool is to use the Public Logs datasource in Grafana which is backed by Loki. Please let us know if the Observability team can be of further assistance.

Scap has:

scap/main.py
    def increment_stat(self, stat, all_stat=True, value=1):
        """Increment a stat in deploy.*

        :param stat: String name of stat to increment
        :param all_stat: Whether to increment deploy.all as well
        :param value: How many to increment by, default of 1 is normal
        """
        self.get_stats().increment("deploy.%s" % stat, value)
        if all_stat:
            self.get_stats().increment("deploy.all", value)

...
    def _before_exit(self, exit_status):
        if self.config:
            self.get_stats().timing("scap.scap", self.get_duration() * 1000)
        return exit_status

There are only three usages of increment_stat():

scap/main.py:        self.increment_stat("scap")
scap/main.py:        self.increment_stat("sync-file")
scap/main.py:        self.increment_stat("sync-wikiversions")

Metrics / logs

MetricLog/announce
deploy.scapFinished scap sync-world: %s (duration: %s) % (message, duration)
deploy.sync-fileSynchronized %s: %s (duration: %s) % (file, message, duration)
deploy.sync-wikiversionsrebuilt and synchronized wikiversions files: %s % (message)

Tentatively:

  • Train deployments was for deploy.scap and can now be mapped to message Finished scap-sync-world.
  • MW deploy was for deploy.sync-file, which AFAIK is no more used and has been aliased to scap sync-world.

Note from @colewhite, the scap logs are available at (requires NDA login) https://grafana.wikimedia.org/goto/XO75YxZDR?orgId=1

Example:

TimeLine
2025-12-02 18:06:09Finished scap sync-world: https://gerrit.wikimedia.org/r/1208442 T407553 (duration: 06m 36s)

scap.announce passes its args to log.info which formats the message and loose the context data (message, duration).

Thus I guess we can create an annotation that looks for line starting with Finished scap sync-world.

@Michael pointed to https://grafana-rw.wikimedia.org/d/vGq7hbnMz/special3a-homepage-and-suggested-edits as a dashboard with some working annotations.

For the Train deploy signal it uses {channel="scap"} |~ "rebuilt and synchronized wikiversions files" which is the deploy.sync-wikiversions metric. It has a disabled one for "MW deploys" that uses {channel="scap"} |~ "(?i)finished|synchronized" which should match up with the deploy.scap + deploy.sync-file signals.

I took a shot at fixing https://grafana-rw.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s. I think it works. I used {channel="scap"} |~ "Finished scap sync-world|Synchronized" for the MW deploy query. Just "(?i)finished|synchronized" seemed to pick up helmfile deployments, or maybe they were legacy scap3 things?

For awareness, there's also a regression affecting annotations in Grafana: https://github.com/grafana/grafana/issues/110265

For awareness, there's also a regression affecting annotations in Grafana: https://github.com/grafana/grafana/issues/110265

I noticed that behavior when I was working on the application-servers-red-k8s dashboard, but I didn't know it was a regression. It just seemed wrong. :)

@colewhite is there a way to search for all of the dashboards that have graphite annotations? I fixed the 3 that @hashar explicitly listed, but I'm not excited enough about this to want to manually check every dashboard.

@colewhite is there a way to search for all of the dashboards that have graphite annotations? I fixed the 3 that @hashar explicitly listed, but I'm not excited enough about this to want to manually check every dashboard.

I happened to have the tools handy. I'm guessing ~114 in total.

Graphite Dashboard Variables (Enabled)

Graphite Dashboard Variables (Disabled)