Page MenuHomePhabricator

[L] Image suggestions data pipeline monitoring
Closed, ResolvedPublic

Description

The most recent import of image-suggestions data into elasticsearch went weird, and we noticed something was wrong because we had ~2900 (instead of ~130k) pages returned from a search for hasrecommendation:image on ptwiki

We need some kind of monitoring to make sure that we're alerted if something has gone wrong with the data pipeline


  • Count total number of pages with hasrecommendation:image across a selection of wikis once per week, and push the count to prometheus
  • send an alert to sd-alerts@lists.wikimedia.org if any of the following are true
  • count has changed by >10% since last week
  • it's significantly over a week since the total number of pages was recorded
  • the push gateway has been reporting a failure for over 15m

Event Timeline

CBogen renamed this task from Image suggestions data pipeline monitoring to [L] Image suggestions data pipeline monitoring.Jul 27 2022, 4:20 PM
CBogen updated the task description. (Show Details)

Suggested things to monitor:

  • check latest relevant snapshot exists (date of most the recent Monday minus one week) for all image_suggestions_* tables in hive analytics_platform_eng
  • number of rows in image_suggestions_* hive tables is > X for the latest snapshot (X depends on table)
  • select some rows from image_suggestions_suggestions, check they are in Cassandra
  • make sure number of rows in image_suggestions_search_index_full is larger than the number of rows in image_suggestions_search_index_delta for the latest snapshot

Not a blocker, but the Hive-to-Cassandra tasks seem to have failed & auto-recovered a few times. Pasting the stackrace:

java.io.IOException: Failed to write statements to image_suggestions.suggestions. The
latest exception was
  Cassandra timeout during SIMPLE write query at consistency LOCAL_QUORUM (2 replica were required but only 1 acknowledged the write)

Please check the executor logs for more exceptions and information
             
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$3.apply(TableWriter.scala:238)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$3.apply(TableWriter.scala:236)
	at scala.Option.map(Option.scala:146)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:236)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
	at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:105)
	at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:122)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:104)
	at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:198)
	at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:185)
	at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:172)
	at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
	at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Perhaps the cassandra server was busy?

Change 842418 had a related patch set uploaded (by Cparle; author: Cparle):

[operations/puppet@production] Added structured data team

https://gerrit.wikimedia.org/r/842418

Change 842420 had a related patch set uploaded (by Cparle; author: Cparle):

[operations/alerts@master] Alert for image suggestions pipeline

https://gerrit.wikimedia.org/r/842420

Change 842418 merged by Filippo Giunchedi:

[operations/puppet@production] Added structured data team

https://gerrit.wikimedia.org/r/842418

Change 843420 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: fix yaml for structured-data AM router

https://gerrit.wikimedia.org/r/843420

Change 843420 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: fix yaml for structured-data AM router

https://gerrit.wikimedia.org/r/843420

Change 843960 had a related patch set uploaded (by Cparle; author: Cparle):

[operations/alerts@master] Alerts for image suggestions pipeline

https://gerrit.wikimedia.org/r/843960

Change 843996 had a related patch set uploaded (by Cparle; author: Cparle):

[operations/alerts@master] Alerts for image suggestions pipeline

https://gerrit.wikimedia.org/r/843996

Change 843960 abandoned by Cparle:

[operations/alerts@master] Alerts for image suggestions pipeline

Reason:

https://gerrit.wikimedia.org/r/843960

Change 842420 abandoned by Cparle:

[operations/alerts@master] Alert for image suggestions pipeline

Reason:

Abandoning in favour of https://gerrit.wikimedia.org/r/c/operations/alerts/+/843996

https://gerrit.wikimedia.org/r/842420

Cparle updated the task description. (Show Details)

Change 843996 merged by jenkins-bot:

[operations/alerts@master] Alerts for image suggestions pipeline

https://gerrit.wikimedia.org/r/843996

fyi still a few little issues with this, hoping the latest deployment will fix them

OK all working now, hooray!