Page MenuHomePhabricator

Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data
Closed, ResolvedPublic0 Estimated Story Points

Description

What happens if indexing fails?

How do we make sure metric computations have not changed significantly between runs? How we prevent from serving bad data from a bad run?

Rollback strategy. Can we have 2 snapshots and only flip when one is good?

Event Timeline

I guess T155507 can help in identifying bad runs of mediawiki history reconstruction.

Plenty of possible different ways here. Listing the two that makes most sense to me:

  • Oozie style: add steps to the oozie mediawiki-reduced job to:
    • Check for metrics validity while warming cache for the new datasource (pre-implemented set of requests to be run in python against current and new datasource, checking for discrepencies higher than a threshold).
    • In case the script succeeds, run another script updating a value in cassandra about current datasource to be used (this means we update AQS-wikistats module to read its druid datasource configuration from cassandra - very easily doable, would even be good if we could add a internal cache with regular checks against the table - To be confirmed / discussed with the services team)
    • pros/cons: Use our loved/hated oozie (alarms, hue, logic/scheduling decoupled)
  • AQS style: Update AQS-wikistats to discover new datasources and check/warm them
    • Regularly check for new datasource in druid
    • When a new one comes in, make a bunch of requests and validate results against currently used source (this also warms the cache)
    • If results match enough, swap datasources used for prod, else report error.
    • pros/cons: Everything in a single place, probably no need for cassandra, but issue with startup: which datasource to pick? Should it be tested.

+1 to @mforns comment

Let's talk about this on our next tasking meeting. I think the best option is the 1st one, so we test validity of data closest to data definition. Thus at creation time. I think warming up of cache should happen after in the AQS deployment step of this data. So warming up of cache is a aqs operation but data loading into druid is contingent on us having some quality score of how good it is.

I think warming up of cache should happen after in the AQS deployment step of this data.

Given we probably want to use Druid as a query engine to check numbers between old and new, cache warming would actually be a side-effect of checking data consistency.

Given we probably want to use Druid as a query engine to check numbers between old and new, cache warming would actually be a side-effect of checking data consistency.

mmm... wait , the data cannot be surfaced outside when we do not know yet whether it is any good. Thus are we talking about requests that are internal to aqs itself? They will warm up the druid cache but in any case should they touch the web cache. Correct?

Let's move this task to tasking i think there is quite a bit to talk about.

First round of discussion with the team:

  • Things we agre on:
    • using multiple datasources in druid (snapshots) seems the way to go to facilitate rollbacks (naming convention could follow our snapshots: YYYY-MM)
    • Data quality checks using old/new datasources in druid seems also interesting for both data quality and cache warming.
  • Thing still to be discussed: How do we swap from old datasource to new datasource in AQS when we think it's ready (or when the other way around when we rollback). Multiple ideas:
    • Use cassandra as a key/value config store (no deploy needed, change can be pushed via API, but we use a ''data'' store to config)
    • Use a dedicated file with its own repo (deploy needed)
    • Use another conf system (etcd, zookeeper...) -- Again another tool ....

TBD !

For the record, I liked Joseph's idea of 3 data sources. One being served right now, one backup, and one being loaded next. When loaded_next is done, it is checked against served_right_now for accuracy and cache warming. When that passes, the backup is deleted, and served_right_now becomes backup, loaded_next becomes served_right_now. How to do this is still up for debate.

The other thing to mention is that Druid in theory supports in-place updating of the kind of data we serve for Wikistats via its Lookup mechanism http://druid.io/docs/latest/querying/lookups.html. We never looked into this in depth, and now with Druid 10 it might be a good idea.

JAllemandou renamed this task from Beta Release: Resiliency, Rollback and Deployment of Data to Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data.Apr 18 2018, 4:55 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from In Progress to Parent Tasks on the Analytics-Kanban board.

Anoher round of discussion with team:

  • Quality checks should happen before data gets loaded into druid
  • Since T155507, we now have statistics over the data generated by Mediawiki-history reconstruction job. The first layer of data quality checking should happen there (subtask: T192481)
  • Another layer of data-quality check should be done over he mediawiki-history-reduced dataset. This implies keeping the data instead of deleting it after druid indexation (subtask: T192482). A new job step would then check data similarity between previous and current snapshot (subtask: T192483).
  • With those checks satisfied, we are ok to index the data in druid, then cache-warming and datasource-swap should happen (no task yet).

Not discussed this time: How do we swap datasources????

mforns set the point value for this task to 0.May 7 2018, 4:05 PM

Ping @Nuria - Can we close this parent task?