Page MenuHomePhabricator

"User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC
Closed, ResolvedPublic

Event Timeline

CDanis triaged this task as High priority.Jan 5 2022, 3:14 PM

Here's the PromQL query that statograph runs to scrape data: link

Looking at that in grafana explore, data is missing for exactly 2022-01-01 00:00 UTC until 2022-01-03 00:00 UTC:

image.png (942×1 px, 139 KB)

The data in Prometheus comes from an exporter that exports the results of an elasticsearch query. That's configured here. Of particular note is the QueryIndices stanza that tells the exporter to query, for a given time, the index corresponding to a certain year.week.

Looking at the data in Logstash directly, it seems that NEL data for 2022-01-01 00:00 UTC until 2022-01-03 00:00 UTC was stored in the Elasticsearch index named w3creportingapi-1.0.0-2-2022.52. i.e. week 52 of 2022.

image.png (710×1 px, 98 KB)

This is apparently something that others have tripped over in the past, as it has to do with the horrifying mess that is ISO week date numbers: https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/541#issuecomment-270973321

So it seems that we need to use xxxx instead of YYYY in our index specification, and that we need to also make es_exporter understand 'weekyears'...

There's also a separate issue here, which is that statograph is getting stuck on the interval where data is missing and still not uploading more. That needs investigation as well.

Per the linked upstream issue, Logstash uses Joda which uses this pattern syntax.

QueryIndices in es-exporter configurations use date math support in index names, an ElasticSearch feature. ElasticSearch uses the Java included DateTimeFormatter pattern syntax.

It seems we need two things for weekly indexes:

  1. Logstash should output weekly indexes to xxxx.ww suffix
  2. es-exporter should query the YYYY.ww suffix

Change 751765 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: update weekly indexes to use weekyear pattern syntax

https://gerrit.wikimedia.org/r/751765

Change 751766 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: update affected es-exporter configs to use weekyear

https://gerrit.wikimedia.org/r/751766

Index curation is affected as well because python's datetime formatter doesn't know weekyear in the same way. We ought to consider using curation based on field_stats or creation_date.

Change 751765 merged by Cwhite:

[operations/puppet@production] logstash: update weekly indexes to use weekyear pattern syntax

https://gerrit.wikimedia.org/r/751765

Change 751766 merged by Cwhite:

[operations/puppet@production] prometheus: update affected es-exporter configs to use weekyear

https://gerrit.wikimedia.org/r/751766

Change 756041 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/statograph@master] Add a start_timestamp constraint

https://gerrit.wikimedia.org/r/756041

image.png (224×865 px, 23 KB)

It took just a single run of statograph -v upload_metrics -t 2022-01-03T00:00Z 1vzzyvjxzgsf to restore things to a good state -- once the most_recent_data_at timestamp had advanced past the missing data, automatic uploads worked again.

The gap from 01 Jan -- 03 Jan will soon rotate out of visibility on the public page, so I'm leaving it rather than doing more work to correct it.

Change 756041 merged by jenkins-bot:

[operations/software/statograph@master] Add a start_timestamp constraint

https://gerrit.wikimedia.org/r/756041