Page MenuHomePhabricator

Compare a few tables per section between hosts and DC
Open, NormalPublic

Description

The s8 incident T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") was caused by a replication gap (which is still under investigation).
This gap is hard to detect if replication doesn't get broken (which should have happened, but so far we don't know why it didn't).

There are plans to automatically check tables for data differences as part of T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves, if we'd have had a process that checks some tables for differences, we could have detected that some tables had differences, and an investigation would have probably started.

The idea is to check some important (but not to big tables) that are likely to have many changes during the day, so we can detect differences.

We could start with the user table on one wiki per section.
This table is concurrent enough but not too big in size that it can be compared in less than an hour for enwiki for example and even if there are schema changes running on that table, they would not be big enough that they can take more than 24h to be completed.

We should use compare.py with non too aggressive options so it can be done automatically and without any risks.

The first approach could be to compare one host per DC, maybe candidate masters of each DC once a day and if there are differences, send an email (the same way we do with the data checks on labs)

Right now there is no inventory where we can automatically detect and select the candidate masters, so the hosts to check should be hardcoded manually (or selected by grepping on db-eqiad.php and db-codfw.php).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 17 2018, 6:38 AM
Marostegui triaged this task as Normal priority.Oct 17 2018, 6:38 AM
Marostegui moved this task from Triage to Next on the DBA board.
Marostegui moved this task from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.

So this has to be done (I will check in case there is a duplicate task already), not arguing against that.

My suggestion would be to make as an actionable, alternatively, or in addition, an alert on revision table gaps for those wikis with enough active contributors (enwiki, commons, wikidatawiki, maybe others). That will also alert in case of non-db issues, like edits not possible or very slow for other infrastructure problems, or other db issues different from consistency (drops, read only, master down, ...).

A similar check could be done on the blobs for content servers. That would cover most of the content.

A third way would be to check consistency between content and metadata servers, which would also have detected the issue (content without corresponding revisions).

Marostegui added a comment.EditedOct 23 2018, 1:17 PM

So this has to be done (I will check in case there is a duplicate task already), not arguing against that.
My suggestion would be to make as an actionable, alternatively, or in addition, an alert on revision table gaps for those wikis with enough active contributors (enwiki, commons, wikidatawiki, maybe others). That will also alert in case of non-db issues, like edits not possible or very slow for other infrastructure problems, or other db issues different from consistency (drops, read only, master down, ...).

I thought it was an actionable already :-), but if you feel it isn't edit the task to make it clearer.
My idea was to have this task to select one (or a few) tables to use as a check on a wiki per section for now. That is why I proposed user for now as that table is mostly append only so most of the wikis will probably have at least one new user created per day.
Fine if we also want to include revision :-)

A similar check could be done on the blobs for content servers. That would cover most of the content.

Agreed, but I would leave it for a second step, once we have a PoC per section running to make sure we are covered there.

A third way would be to check consistency between content and metadata servers, which would also have detected the issue (content without corresponding revisions).

I would also leave that one for a different phase.

that task is mostly append only

I am guessing table- it is mostly updates- every time an edit is done, a counter is updated there. Revisions is probably the one with most inserts (and even it is updated, at least at the moment).

that task is mostly append only

I am guessing table- it is mostly updates- every time an edit is done, a counter is updated there. Revisions is probably the one with most inserts (and even it is updated, at least at the moment).

Yep, I edited it but you were faster!
But every new user is a new row, which could help detecting gaps too.

Marostegui assigned this task to Banyek.Oct 24 2018, 8:19 AM

As per our last chat, assigning it to Balazs as he'll start with it

Marostegui moved this task from Next to In progress on the DBA board.Oct 24 2018, 8:19 AM
Banyek moved this task from Backlog to next on the User-Banyek board.
Banyek moved this task from next to In progress on the User-Banyek board.

In the very first iteration I create a 'config' file which contains a list of sections, dbs, tables, hosts, and a wrapper around it which will start the compare.py with that data.

You might also want to run a first iteration over the hosts and tables you select to make sure the tables are indeed the same from the first day, in order to avoid false positives once the check starts running automatically, so we know we are in a good state.

The first dataset to check will be:

s1 enwiki user user_id db1067 db2048
s2 bgwiki user user_id db1066 db2035
s3 bmwiki user user_id db1075 db2043
s3 jbwiki user user_id db1075 db2043
s3 liwiki user user_id db1075 db2043
s4 commonswiki user user_id db1068 db2051
s5 shwiki user user_id db1070 db2052
s6 frwiki user user_id db1061 db2039
s7 huwiki user user_id db1062 db2040
s8 wikidatawiki user user_id db1071 db2045

According to sizes
For s2 maybe you want to check:

enwiktionary
itwiki
ptwiki

s3:

mediawikiwiki
enwikibooks
frwikisource

s5:

dewiki
shwiki

s7:

arwiki
eswiki
metawiki

Ok, I'll use those!

Banyek added a comment.EditedOct 24 2018, 10:30 AM

I think this could be a good format/data YAML to start with:

s1: 
   primary: db1067
   compare: db2048
   databases:
     enwiki:
       tables:
         user: user_id

s2:
  primary: db1066
  compare: db2035
  databases:
    enwiktionary:
      tables: 
        user: user_id
    itwiki:
      tables: 
        user: user_id
    ptwiki:
      tables: 
        user: user_id
  
s3:
  primary: db1075
  compare: db2043
  databases:
    mediawikiwiki:
      tables: 
        user: user_id
    enwikibooks:
      tables: 
        user: user_id
    frwikisource:  
      tables: 
        user: user_id
        
s4:
  primary: db1068
  compare: db2051
  databases:
    commonswiki:
      tables:
        user: user_id
        
s5:
  primary: db1070
  compare: db2052
  databases:
    dewiki:
      tables:
        user: user_id
    shwiki:
      tables:
        user: user_id
        
s6:
  primary: db1061
  compare: db2039
  databases:
    frwiki:
      tables:
        user: user_id
        
s7:
  primary: db1062
  compare: db2040
  databases:
    arwiki:
      tables:
        user: user_id
    eswiki:
      tables:
        user: user_id
    metawiki:  
      tables:
        user: user_id
        
s8:
  primary: db1071
  compare: db2045
  databases: 
    wikidatawiki:
      tables:
        user: user_id

@Banyek Hardcoding the masters in configurations seems to me like a bad idea- they are already defined redundantly 4 times on mediawiki, on puppet, on tendril and on prometheus. We should reduce the redundancy, not increase it.

There is a tables_to_compare.txt file on operations/software:dbtools repo, you may want to use or replace that- maybe with zarcillo rather than a yaml we may forget about.

There used to be a multi-run functionality on WMFMariaDB.py but I removed it. Maybe it can be added back to another file- but an additional configuration place.

I wouldn't use tables_to_check.txt for now (not entirely, just a few of those, like revision or user).
So the checks don't take too long and we can evaluate as we go, and then expand more tables as we learn about this process, false positives and so forth.

I wouldn't use tables_to_check.txt for now

I wouldn't either, just wanted him to have a look at it and not duplicate the master definition.

@jcrespo You have right indeed, but the current data structure is good in that way that there will be no real problem to remove the 'primary' and maybe the 'compare' fields later. Definitely it would be better for getting those data from a single location (aka. the single source of truth).
I write down the progress/as I move for the prototype and later as I am getting closer to the desired state.

eg. :

s8:
  databases: 
    wikidatawiki:
      tables:
        user: user_id

First I'd like to solve to call the 'compare.py' for all of those sections/databases/tables and parse the output - and when it works with the hardcoded servers move to the problem of 'finding the correct hosts'.

jcrespo added a comment.EditedOct 24 2018, 11:16 AM

To not just be a pain, this is how you can discover the master for a particular section automatically:

root@db1115.eqiad.wmnet[zarcillo]> SELECT instance FROM masters WHERE section='s1' and dc='eqiad';

Note: this is work in progress.
The current yaml structure is the following:

Sections:
- s1:
    Primary: db1067
    Compare: db2048
    Databases:
     - enwiki:
         Tables:
         - user: user_id
     - bgwiki:
         Tables:
         - user: user_id

And running the script produces the following output:

compare.py enwiki user user_id db1067 db2048
compare.py bgwiki user user_id db1067 db2048

Which is actually the invocation of the cli.
On the next iteration I want to get rid of the 'Primary' and the 'Compare' fields, as those are mentioned in T207253#4691432

my idea is to select the 'primary' and the 'compare' hosts by having a --datacenter command line option, and the given datacenter's master host will be treated as the primary, and the other master will be the one to compare
Also those hosts should be able to override from cli switches and/or the by the yaml file itself.
The precendence should be:

automatic < yaml < cli

The config file should be also defined using a cli option.
Few more ideas:
a, if compare.py finds difference in the tables the script should recheck them n times (cli option?) to avoid false-positives
b, the *{table_name: primary_key}* should be deprecated as knowing the table we can check for the primary key as well (running show create table?) - but this functionality maybe should take place in compare.py instead of this tool

Change 469889 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: table checker for monitoring data drift

https://gerrit.wikimedia.org/r/469889

Anomie added a subscriber: Anomie.Nov 8 2018, 2:39 PM

The idea is to check some important (but not to big tables) that are likely to have many changes during the day, so we can detect differences.
We could start with the user table on one wiki per section.
This table is concurrent enough but not too big in size that it can be compared in less than an hour for enwiki for example and even if there are schema changes running on that table, they would not be big enough that they can take more than 24h to be completed.

How about the recentchanges table? It's very active since it gets a row added for every insert to revision and logging, but at the same time rows older than 30 days get deleted so it shouldn't be too huge (currently on enwiki it has 8161120 rows, versus 34907431 for the user table).

Banyek moved this task from In progress to On hold on the User-Banyek board.Dec 3 2018, 11:15 AM
Banyek added a comment.Dec 6 2018, 4:59 PM

@Anomie it worth a look, thanks

Marostegui moved this task from In progress to Next on the DBA board.Dec 12 2018, 6:11 AM
Marostegui removed Banyek as the assignee of this task.Jan 11 2019, 8:37 PM