Page MenuHomePhabricator

Improve ORES observability
Closed, ResolvedPublic

Description

While chatting with the team it became clear that there is not a lot of knowledge about how to investigate an issue if ORES starts to emit a ton of errors, for example score_errored ones.

We should:

  1. Review our dashboards and see what's to fix (some graphs are not displayed etc..)
  2. Figure out how to debug an issue. For example, where to look when somebody wants to investigate why a spike of score_errored happens.
  3. Add basic alarms (SRE moved to AlertManager and we have still to add alarms to it, we never done it, the old alarms are gone).

Event Timeline

I am trying to see if there is a way to have per-model metrics. The ores daemons push statsd metrics to a local endpoint, that in turn aggregates them and exposes them for Prometheus.

I tcpdumped the statsd UDP stream, and this is some sample of the traffic:

ores.ores1001.score_processed.wikidatawiki:96.434832|ms
ores.ores1001.score_processed.wikidatawiki.damaging:96.434832|ms
ores.ores1001.score_processed.wikidatawiki.goodfaith:96.434832|ms
ores.ores1001.score_processed.wikidatawiki.itemquality:96.434832|ms
ores.ores1001.score_processed.wikidatawiki.itemtopic:96.434832|ms
[..]
ores.ores1001.score_processed.ukwiki:60.741663|ms
ores.ores1001.score_processed.ukwiki.damaging:60.741663|ms
ores.ores1001.score_processed.ukwiki.goodfaith:60.741663

The same pattern repeats for score_errored, etc.. There is a rule in the statsd exporter that discard completely the model bit (that in theory could become a prometheus label):

- labels:
    wiki: "$4"
  match: "^ores\\.([a-z]*)([0-9]*)\\.(precache_cache_hit|precache_cache_miss|revision_scored|score_cache_hit|score_cache_miss|score_errored|score_processor_overloaded)\\.([a-z]*)$"
  match_type: regex
  name: ores_${3}_total

The main weirdness that I see is that for score_processed the same value repeats across multiple model types (is it due to something that I don't see?). For score_errored, it seems better:

ores.ores1001.score_errored.wikidatawiki:1|c
ores.ores1001.score_errored.wikidatawiki.damaging:1|c

Fixed and cleaned up old panels in our dashboards.

The main weirdness that I see is that for score_processed the same value repeats across multiple model types (is it due to something that I don't see?). For score_errored, it seems better:

In https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid I see that there is the possibility of scoring multiple models at the same time.

@Halfak Hi! If you have time I'd like to ask you a question about how to track down the errors behind score_errored: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=72&orgId=1&refresh=1m

I tried to match the spikes with:

  • logstash ORES error stream
  • logs (app uwsgi celery etc..) on the ores nodes
  • traffic
  • Redis/Lock-Manager metrics
  • ORES HTTP Return codes

But I don't really find any correlation or useful info. Tried to check in the code but I didn't really find a good explanation, so I am wondering if you could share some insights about what the metric represents and how important it is. Thanks in advance :)

These are edits to Wikidata erroring. They might be edits to regular wiki pages. The damage and item quality models were made to assess edits to entities (items, properties) so it errors when trying to process wikitext.

Errored scores can happen for a number of reasons though. E.g. when some data has been deleted (e.g. a user name, a page, a specific revision). It can also happen when there is a bug in the code that prevents score generation, but in practice, that is very rare.

For how they are generated, see these two places where the metric is generated:

The feature extraction errors bubble up from here: https://github.com/wikimedia/ores/blob/0bd572d78ab210377281b0a919d13e34588d92fb/ores/scoring_context.py#L171

RevisionNotFound: https://github.com/wikimedia/revscoring/blob/master/revscoring/extractors/api/extractor.py#L135
UserNotFound: https://github.com/wikimedia/revscoring/blob/master/revscoring/extractors/api/extractor.py#L198
Data processing error (e.g. expected Entity but got Wikitext): https://github.com/wikimedia/revscoring/blob/master/revscoring/extractors/api/extractor.py#L213

It would be possible to extend the metrics collector so that it records what type of error occurred. You'd want to start here: https://github.com/wikimedia/ores/blob/master/ores/metrics_collectors/metrics_collector.py#L31 I might add a error_name field and pass the name of the exception. Most should be pretty informative. We could also log an error event when the error is anything except a subclass of MissingResource -- which PageNotFound, UserNotFound, UserDeleted, etc. all fit within. Those errors are expected to happen and are a natural consequence of the ability to delete stuff. Any other type of error suggests there could be a bug.

@Halfak thanks a lot for the detailed summary, it helps a lot! I would definitely like to have some logs indicating what happened during a certain timeframe, to be able to correlate spikes in score_errored and exceptions. My main fear is that one of these days we'll see a sustained increase in errors that needs to be debugged, and currently IIUC there is no quick way of understanding what's wrong right (for example, correlating a spike in wikidata score_errored with exceptions related to feature extraction problems, that may end up being a client trying to score something wrong etc..) ?

New ORES logstash dashboard! https://logstash.wikimedia.org/app/dashboards#/view/ORES

It is now way nicer and complete, including UA and URI breakdowns.

Change 758779 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] ores::web: log XFF header as REMOTE_ADDR when available

https://gerrit.wikimedia.org/r/758779

Change 758779 merged by Elukey:

[operations/puppet@production] ores::web: log XFF header as REMOTE_ADDR when available

https://gerrit.wikimedia.org/r/758779

Mentioned in SAL (#wikimedia-operations) [2022-02-01T11:31:03Z] <elukey> roll restart ORES to pick up logging change (use XFF header when possible) - T299137