"User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Jan 5 2022, 3:14 PM

Details

Subject	Repo	Branch	Lines +/-
Add a start_timestamp constraint	operations/software/statograph	master	+32 -4
prometheus: update affected es-exporter configs to use weekyear	operations/puppet	production	+5 -2
logstash: update weekly indexes to use weekyear pattern syntax	operations/puppet	production	+45 -6

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T22079 Provide a better means of status update delivery in WMF error message
Open		None	T202061 Implement an accurate and easy to understand status page for all wikis
Resolved		CDanis	T285569 Automated uploads of minimal & comprehensible timeseries metrics for statuspage display
Resolved		CDanis	T298619 "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC

Event Timeline

CDanis created this task.Jan 5 2022, 3:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 5 2022, 3:14 PM

CDanis triaged this task as High priority.Jan 5 2022, 3:14 PM

Here's the PromQL query that statograph runs to scrape data: link

Looking at that in grafana explore, data is missing for exactly 2022-01-01 00:00 UTC until 2022-01-03 00:00 UTC:

The data in Prometheus comes from an exporter that exports the results of an elasticsearch query. That's configured here. Of particular note is the QueryIndices stanza that tells the exporter to query, for a given time, the index corresponding to a certain year.week.

Looking at the data in Logstash directly, it seems that NEL data for 2022-01-01 00:00 UTC until 2022-01-03 00:00 UTC was stored in the Elasticsearch index named w3creportingapi-1.0.0-2-2022.52. i.e. week 52 of 2022.

This is apparently something that others have tripped over in the past, as it has to do with the horrifying mess that is ISO week date numbers: https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/541#issuecomment-270973321

So it seems that we need to use xxxx instead of YYYY in our index specification, and that we need to also make es_exporter understand 'weekyears'...

There's also a separate issue here, which is that statograph is getting stuck on the interval where data is missing and still not uploading more. That needs investigation as well.

Per the linked upstream issue, Logstash uses Joda which uses this pattern syntax.

QueryIndices in es-exporter configurations use date math support in index names, an ElasticSearch feature. ElasticSearch uses the Java included DateTimeFormatter pattern syntax.

It seems we need two things for weekly indexes:

Logstash should output weekly indexes to xxxx.ww suffix
es-exporter should query the YYYY.ww suffix

Change 751765 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: update weekly indexes to use weekyear pattern syntax

https://gerrit.wikimedia.org/r/751765

Change 751766 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: update affected es-exporter configs to use weekyear

https://gerrit.wikimedia.org/r/751766

Index curation is affected as well because python's datetime formatter doesn't know weekyear in the same way. We ought to consider using curation based on field_stats or creation_date.

Change 751765 merged by Cwhite:

[operations/puppet@production] logstash: update weekly indexes to use weekyear pattern syntax

https://gerrit.wikimedia.org/r/751765

Change 751766 merged by Cwhite:

[operations/puppet@production] prometheus: update affected es-exporter configs to use weekyear

https://gerrit.wikimedia.org/r/751766

Maintenance_bot removed a project: Patch-For-Review.Jan 7 2022, 10:10 PM

Change 756041 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/statograph@master] Add a start_timestamp constraint

https://gerrit.wikimedia.org/r/756041

gerritbot added a project: Patch-For-Review.Jan 21 2022, 6:38 PM

It took just a single run of statograph -v upload_metrics -t 2022-01-03T00:00Z 1vzzyvjxzgsf to restore things to a good state -- once the most_recent_data_at timestamp had advanced past the missing data, automatic uploads worked again.

The gap from 01 Jan -- 03 Jan will soon rotate out of visibility on the public page, so I'm leaving it rather than doing more work to correct it.

CDanis added a parent task: T285569: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display.Jan 24 2022, 3:53 PM

Change 756041 merged by jenkins-bot:

[operations/software/statograph@master] Add a start_timestamp constraint

https://gerrit.wikimedia.org/r/756041

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2022, 10:10 PM

CDanis mentioned this in T370386: statograph_post errors with out of range float values since 2024-07-16.Jul 24 2024, 7:34 PM

	F34926035: image.png
	Jan 21 2022, 7:01 PM

	F34908451: image.png
	Jan 5 2022, 3:45 PM

	F34908445: image.png
	Jan 5 2022, 3:45 PM

"User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC Closed, ResolvedPublicActions

Details

Related ObjectsSearch...

Event Timeline

"User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC
Closed, ResolvedPublic
Actions

Related Objects
Search...