We need to create and monitor some measures of what it means for the image-suggestions pipelines to be running ok
ATM we're monitoring and alerting on
- too long since data pipeline metrics were written (data pipeline metrics get written at the end of the pipeline process, so this means the pipeline has failed somewhere along the way)
- a large change in the number of suggestions
- a failure to push pipeline metrics (this has never happened)
If we get alerted we have a runbook
We also have agreement on allowed revert rates on images that have been added because of our suggestions.
What we don't have is:
- agreement on how old the suggestions are allowed to be (e.g. suggestions must be less then a week old for 50 weeks in the year)
- agreement on expected/allowed rejection rates for suggestions (from Growth's tooling)
- agreement on uptime for the suggestions api
- anything else?
SLO breakdown
We depend on the following teams and need to figure out what guarantees they can make. In order of priority:
- Data Engineering
- upstream data dependencies - missing wmf.wikidata_item_page_link snapshots have been the most frequent and critical issue so far
- data pipeline infrastructure
- an-airflow1004.eqiad.wmnet Airflow machine
- DAGs repo
- GitLab CI tools
- Search - discovery.cirrus_index_without_content/cirrus_replica=codfw snapshots
- Enterprise - HTML dumps
The following teams depend on us and we need to know their minimum requirements:
- Search
- Growth
- Android
- ourselves - notifications & MediaSearch
- iOS - maybe, we should check whether they have something