Page MenuHomePhabricator

Implement monitoring for articletopic / weighted_tags
Closed, DeclinedPublic

Description

As seen in T285577: Several wikis have 0 articles for all ORES topics, we recently had an outage where articletopic queries returned zero results for numerous wikis. We were alerted to this by a community member, and it would have been better to have monitoring so we'd know there is a problem shortly after the code reached group2.

While GrowthExperiments shouldn't break as badly as it did when no topics are available (and we've fixed that), we do rely heavily on the articletopic query for our features.

This task exists for Growth-Team and Discovery-Search to talk about what kind of monitoring would be useful to implement.

Event Timeline

@Gehel @MPhamWMF I think this is mostly about what type of monitoring your team can/should implement as part of ensuring the articletopic field (and possibly other fields, like hasrecommendation) remain available. In this instance, an update to CirrusSearch caused the outage, but the root of the problem seemed to be elsewhere in the pipeline, so maybe there is monitoring that should be implemented earlier in the process.

The root cause of this bug is a human mistake and the known fragilities of our re-indexing process.
As for this particular problem what happened is:

  • originally the field used in elasticssearch was ores_articletopics
  • we then decided to generalize this field to weighted_tags
  • a renamed happened in the code but with some BC code available to still query ores_articletopics
  • the removal of the BC code was conditioned to a reindex of the wikis (that actually performs the migration on the elasticsearch side)
  • some months passed (divergence between the schema created by the code and the actual schema in elasticsearch: not ideal)
  • I decided to check if the BC code could actually be removed and verified a set of wikis that I thought could be affected by the BC removal (checking here means verifying that the reindex happened and the ores_articletopics -> weighted_tags had happened). This is the main mistake as the set of wikis I checked is not the ones that were affected by the issue.
  • the BC code was removed and merged and the train rolled forward leading to the issue

I think the main problem is that we lack a good overview of the pending changes that require a reindex. Reindexing all the wikis is time consuming and error-prone and we sometimes reindex only a few wikis to limit the burden on the team but this obviously did not work well in this case.
Regarding what we planned to improve this we created:

There are other parts of the pipeline that are involved in the behavior of these two search keywords has but are properly monitored via airflow.

There could be more direct ways to monitor this by adding an alert calling the search API on the production wikis but this would just have improved our reactivity and not have prevented the bug from hitting end users.

Since we have a few improvements tracked, as noted by @dcausse. Let's close this ticket and focus on the specific fixes.

Regarding what we planned to improve this we created:

Since we have a few improvements tracked, as noted by @dcausse. Let's close this ticket and focus on the specific fixes.

Regarding what we planned to improve this we created:

Thank you! ❤