Page MenuHomePhabricator

Labs database replica drift
Closed, ResolvedPublic

Description

https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database/Replica_drift

2017-10-17: Replica drift was a recurring problem for the Wiki Replicas prior to the introduction of row-based replication (RBR) between the sanitarium server(s) and their upstream sources. The RBR replication used to populate the *.{analytics,web}.db.svc.eqiad.wmflabs servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.

Please report database drift/inconsistencies as a comment on this task.

Please mention:

  • the SQL query, preferably in a format that is easy to compare between production and labs (count(*), md5sum, ...)
  • the actual result
  • the expected result (e.g. based on the live wiki, or a rough indication of why the query is wrong)

Related tasks:
T74413: Different results with queries in labs versus production
T106470: Tool Labs enwiki_p replicated database missing rows
T115207: Lots of rows are missing from enwiki_p.`revision`
T115517: Data missing from June 11/12 on s3.labsdb
T118095: Missing rows in revision table of enwiki.labsdb (data integrity issue)
T126946: disk failure on labsdb1002
T129432: Lost database changes on s2 for 3 hours on labs replicas
T133469: Discrepancy between labsdb replicas of arwiki_p.user_groups
T133715: Missing data on labs replica database
T134203: enwiki_p replica on s1 is corrupted
T136618: Wrong page title in labs database replica enwiki page table
T138927: nlwiki Labs replica with problems
T153058: LabsDB infrastructure pending work

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

May I ask you to please copy the query here to check the drift? We keep here the older subtasks only because they are older than this one, but following a single format will help speed up the resolution of all issues (it would be almost impossible to handle one ticket per row with issues). Thank you in advance- I am just trying to respond quickly to all issues!

MariaDB [enwiki_p]> select page_title from page join pagelinks on pl_from = page_id where page_namespace=0 and pl_namespace = 0 and pl_title="Will_Johnson";
+----------------------------------+
| page_title                       |
+----------------------------------+
| Denver_Broncos                   |
| List_of_current_AFC_team_rosters |
| Will_Johnson_(disambiguation)    |
| William_Johnson_(rugby_player)   |
| Will_Johnson_(American_football) |
| Johnson,_Will                    |
| 2016_Denver_Broncos_season       |
+----------------------------------+
7 rows in set (0.01 sec)

The comparable query to the enwiki API yields only 4 incoming links.

https://en.wikipedia.org/w/api.php?action=query&list=backlinks&bltitle=Will_Johnson&blnamespace=0&format=jsonfm

@russblau Thanks for the report- it is 5 as we speak, but it is indeed wrong. This and hopefully all drift issues are fixed on the imports on the new labsdb servers T147052, that I hope they will be soon available.

https://quarry.wmflabs.org/query/5979

This as of 30 Novemeber 2016 is showing 45 rows that the query says are 'orphaned' but when checked on English Wikipedia certainly arent.

I was asked to mention this here, although I'm unsure if it's actually a 'replication' issue.

https://quarry.wmflabs.org/query/14916 reports 35 entries in the transcode table, all for files that were uploaded to Commons in 2013, and then renamed while still being transcoded. This created 'orphaned' entries in the transcode table.

AFAICT, the bug that caused this seems to have been long ago fixed, as the examples here are all from 2013. Before the recent cleanup that's been going on with old failed transcodes, there were many other entries here, all related to files that had been deleted while being transcoded. These were easily fixed by simply undeleting and redeleting the file, a couple of months ago. The entries were all, also, from years ago... I suspect these were the relics of fixed bugs, in the database.

There seems to be no way from the 'admin' end to fix these entries, as they are for redirects, not videos.... they could probably usefully simply be deleted from the DB, and then we can watch to see if it recurs.

The particular search is relevant, however, as both times recently where the video scaler system was reset (last month, and earlier this month) all 'running' transcodes were populated into it.... they were shown as 'completed' on the file pages, but there were reports (such as T154186) that supposedly successfully transcoded videos would not play. Requeueing the videos, so that they were transcoded again, seemed to fix the issue.

Resetting a transcode, while it is running, also results in it ending up in this search.

This also seems to be a bug, but a somewhat useful one.

@Revent: As the production database servers return the same result, this has nothing to do with labs replication.

@Revent then I missunderstood you, we should file a proper, standalone bug.

I have found six eswiki deleted articles which still appear in the page table in Labs, located by using this trial-and-error query after one of them was detected in my Labs tool:

MariaDB [eswiki_p]> select page_title from page where page_is_new = 1 and page_touched like '2015061%' and page_namespace = 0 and page_is_redirect = 0 and page_touched = page_links_updated;
+---------------------------------------------------------------------+
| page_title                                                          |
+---------------------------------------------------------------------+
| Cristian_Manea                                                      |
| Morgawr                                                             |
| XFL_Airabonita                                                      |
| Lego_Battles:_Ninjago                                               |
| FIFA_16.                                                            |
| Yessica_Puín_nació_en_1999/07/16_En_la_Ciudad_de_Bogotá-Colombia    |
+---------------------------------------------------------------------+
6 rows in set (9.70 sec)

Fixed:

root@labsdb1003[eswiki_p]> select page_title from page where page_is_new = 1 and page_touched like '2015061%' and page_namespace = 0 and page_is_redirect = 0 and page_touched = page_links_updated;
Empty set (9.90 sec)

root@labsdb1001[eswiki_p]> select page_title from page where page_is_new = 1 and page_touched like '2015061%' and page_namespace = 0 and page_is_redirect = 0 and page_touched = page_links_updated;
Empty set (10.43 sec

There seem to be several pages on English Wikipedia which have been deleted but still appear on the Labs and Analytics Store replicas in the page table. For example:

MariaDB [enwiki_p]> select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever';
+----------+----------------+---------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+
| page_id  | page_namespace | page_title    | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model |
+----------+----------------+---------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+
| 49995280 |              0 | BatissForever |                   |            0 |                0 |           0 | 0.638177166992 | 20160331025325 | 20160330185410     |   712718594 |      212 | wikitext           |
+----------+----------------+---------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+

"several" is too vague- I have fixed the one given:

root@neodymium:~$ ./sql.py -h labsdb1001.eqiad.wmnet enwiki -e "select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'" --no-dry-run

Results for labsdb1001.eqiad.wmnet:3306/enwiki:
0 rows in set (0.00 sec)
root@neodymium:~$ ./sql.py -h labsdb1003.eqiad.wmnet enwiki -e "select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'" --no-dry-run

Results for labsdb1003.eqiad.wmnet:3306/enwiki:
0 rows in set (0.00 sec)

Here are a few more such deleted pages that I happen to know of:

Labs:

MariaDB [enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 1215
1 row in set (0.04 sec)

MariaDB [enwiki_p]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 149
1 row in set (0.00 sec)

MariaDB [enwiki_p]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 7509
1 row in set (0.96 sec)

Production:

mysql:wikiadmin@db1083 [enwiki]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 15
1 row in set (0.01 sec)

mysql:wikiadmin@db1083 [enwiki]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2
1 row in set (0.00 sec)

mysql:wikiadmin@db1083 [enwiki]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2205
1 row in set (0.03 sec)

To clarify, the only reliable solution is the start using the new servers already (labsdb-web.eqiad.wmnet and labsdb-analytics.eqiad.wnet). See:

root@labsdb1009[enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 15
1 row in set (0.01 sec)

root@labsdb1009[enwiki_p]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2
1 row in set (0.00 sec)

root@labsdb1009[enwiki_p]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2205
1 row in set (0.04 sec)

The fixes I will be doing to conciliate the existing labsdb1001 and 1003 will only be poor hacks, that will break again sortly.

Is there a task tracking the updating of the DNS entries such as enwiki.labsdb to point to the new servers?

No, we are not ready yet because not all wikis are available yet, but you can follow the meta ticket at: T153058

We will be finishing the reimports first, then announcing it as beta first "opt-in", and later changing the dns defaults.

Labs now:

root@labsdb1001[enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 15
1 row in set (0.75 sec)

root@labsdb1001[enwiki_p]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2
1 row in set (0.30 sec)

root@labsdb1001[enwiki_p]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2205
1 row in set (4.30 sec)

Thank you for the reports, keep reporting important differences as comments here, just remember the goal is to get rid of the unsafe current labs hosts so we do not have to keep doing this :-/.

@jcrespo: I don't have any way to identify all the extra pages on Tool Labs. All I can say is that there are 1032 extra there:

Tool Labs:

MariaDB [enwiki_p]> select count(*) from page;
+----------+
| count(*) |
+----------+
| 42266366 |
+----------+

Production:

mysql:wikiadmin@db1083 [enwiki]> select count(*) from page;
+----------+
| count(*) |
+----------+
| 42265334 |
+----------+

Is there not any way to sync them? This is causing problems for us, but we can work around it if we have to.

Maybe what we can try is to stop sanitarium2 and sanitarium at the same replication position and import the page table (it is around 7.3G).

@Marostegui - you can do that if you want, but a) it will cause more complains of missing pages and the lag b) it will break in only few days (all tables were reimported for enwiki in a long process that took 4 months, and by the end of it, it had already drifted again).

If Kaldari want a reliable replica, he should start using labsdb-web.eqiad.wmnet or labsdb-analytics.eqiad.wmnet now. Current labsdb hosts have no fix because mediawiki does not support labs correctly, and we have to workaround that with the new replication method.

@jcrespo: Is there (or will there be) a way to access and create user databases on the new servers (as per https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#User_databases)?

@russblau No, new databases are in read only mode. We are unsure how we can support that and how (because of T156869), so for now all new servers are in full read only mode. You can create databases on toolsdb as usual.

Cross-post from T166091: Labs database corruption:

SELECT pl_namespace, pl_title
FROM page
JOIN pagelinks ON pl_from = page_id
WHERE page_namespace=0 AND page_title="Ajinkya_Rahane"
AND pl_title LIKE "Ankit%";
+--------------+--------------+
| pl_namespace | pl_title     |
+--------------+--------------+
|            0 | Ankit_Sharma |
+--------------+--------------+

From a production database server:

mysql:wikiadmin@db1083 [enwiki]> SELECT pl_namespace, pl_title
    -> FROM page
    -> JOIN pagelinks ON pl_from = page_id
    -> WHERE page_namespace=0 AND page_title="Ajinkya_Rahane"
    -> AND pl_title LIKE "Ankit%";
+--------------+--------------------------+
| pl_namespace | pl_title                 |
+--------------+--------------------------+
|            0 | Ankit_Sharma_(cricketer) |
+--------------+--------------------------+
1 row in set (0.00 sec)

Copied from T167554: Missing plwikisource_p.revision and plwikisource_p.page data on labs replica database by @Ankry:

As this query shows (https://quarry.wmflabs.org/query/19383)
Revisions from 1147609 to 1147764 are missing from plwikisource_p.revision table in labs database.

They do not seem to be deleted, so some Quary queries report incomplete revision lists (or incorrect revision count) for specific pages, eg.:
https://quarry.wmflabs.org/query/19385
vs.
https://pl.wikisource.org/w/index.php?title=Strona:M._Arcta_s%C5%82ownik_ilustrowany_j%C4%99zyka_polskiego_-_Tom_1.djvu/732&action=history

can it be fixed?

LabsDB
$ sql plwikisource_p
MariaDB [plwikisource_p]> SELECT rev_id FROM revision WHERE rev_id = 1147693;
Empty set (0.00 sec)

MariaDB [plwikisource_p]> SELECT MAX(rev_id) FROM revision WHERE rev_id < 1147693;
+-------------+
| MAX(rev_id) |
+-------------+
|     1147608 |
+-------------+
1 row in set (0.00 sec)

MariaDB [plwikisource_p]> SELECT MIN(rev_id) FROM revision WHERE rev_id > 1147693;
+-------------+
| MIN(rev_id) |
+-------------+
|     1147765 |
+-------------+
1 row in set (0.00 sec)
prod
$ sql plwikisource
(wikiadmin@db1077) [plwikisource]> SELECT rev_id FROM revision WHERE rev_id = 1147693;
+---------+
| rev_id  |
+---------+
| 1147693 |
+---------+
1 row in set (0.00 sec)

(wikiadmin@db1077) [plwikisource]> SELECT MAX(rev_id) FROM revision WHERE rev_id < 1147693;
+-------------+
| MAX(rev_id) |
+-------------+
|     1147692 |
+-------------+
1 row in set (0.00 sec)

(wikiadmin@db1077) [plwikisource]> SELECT MIN(rev_id) FROM revision WHERE rev_id > 1147693;
+-------------+
| MIN(rev_id) |
+-------------+
|     1147694 |
+-------------+
1 row in set (0.00 sec)

The revisions missing from plwikisource_p were created between 20160428113148 and 20160428122541.
page table entries created in the same period, eg.
https://pl.wikisource.org/w/index.php?title=Specjalna:Wk%C5%82ad/AkBot&dir=prev&offset=20160428110138&limit=130&target=AkBot
are also missing. (I checked few random from the above list - none is found in labs)

bd808 triaged this task as Medium priority.Jun 13 2017, 7:25 PM
bd808 added a project: Epic.

Cross-post from T166091: Labs database corruption:

SELECT pl_namespace, pl_title
FROM page
JOIN pagelinks ON pl_from = page_id
WHERE page_namespace=0 AND page_title="Ajinkya_Rahane"
AND pl_title LIKE "Ankit%";
+--------------+--------------+
| pl_namespace | pl_title     |
+--------------+--------------+
|            0 | Ankit_Sharma |
+--------------+--------------+

This is still happening on the default and busted replicas:

mzmcbride@tools-bastion-03:~$ sql enwiki_p;
MariaDB [enwiki_p]> SELECT pl_namespace, pl_title FROM page JOIN pagelinks ON pl_from = page_id WHERE page_namespace=0 AND page_title="Ajinkya_Rahane" AND pl_title LIKE "Ankit%";
+--------------+--------------+
| pl_namespace | pl_title     |
+--------------+--------------+
|            0 | Ankit_Sharma |
+--------------+--------------+
1 row in set (0.01 sec)

But when connecting to labsdb-web.eqiad.wmnet, there's better data:

mzmcbride@tools-bastion-03:~$ mysql -hlabsdb-web.eqiad.wmnet enwiki_p;
MariaDB [enwiki_p]> SELECT pl_namespace, pl_title FROM page JOIN pagelinks ON pl_from = page_id WHERE page_namespace=0 AND page_title="Ajinkya_Rahane" AND pl_title LIKE "Ankit%";
+--------------+--------------------------+
| pl_namespace | pl_title                 |
+--------------+--------------------------+
|            0 | Ankit_Sharma_(cricketer) |
+--------------+--------------------------+
1 row in set (0.00 sec)

I updated this task's description a bit.

From arwiki: File:الله عز وجل.png. Deleted in Nov 2015.

$ sql arwiki_p -e 'SELECT 1 AS "Exists" FROM arwiki_p.image WHERE img_name="الله_عز_وجل.png";'
Exists
1
$ sql enwiki_p -e 'SELECT 1 AS "Exists" FROM arwiki_p.image WHERE img_name="الله_عز_وجل.png";'
$ # No results
SELECT COUNT(rev_id) FROM enwiki_p.revision WHERE rev_page = 48357647

returns 4 when it should be 0, as the page was deleted on May 4, 2017. I fixed this by just restoring it and deleting it again, but thought I'd still let you know. Apparently there are a number of such Draft pages that were deleted around May 4 that are still in the revision table.

From arwiki: File:الله عز وجل.png. Deleted in Nov 2015.

$ sql arwiki_p -e 'SELECT 1 AS "Exists" FROM arwiki_p.image WHERE img_name="الله_عز_وجل.png";'
Exists
1
$ sql enwiki_p -e 'SELECT 1 AS "Exists" FROM arwiki_p.image WHERE img_name="الله_عز_وجل.png";'
$ # No results

Using the new wiki replicas:

$ mysql --defaults-file=$HOME/replica.my.cnf -h wikireplica-web.eqiad.wmnet
(u3518@wikireplica-web.eqiad.wmnet) [(none)]> use arwiki_p;
Database changed

(u3518@wikireplica-web.eqiad.wmnet) [arwiki_p]> SELECT 1 AS "Exists" FROM arwiki_p.image WHERE img_name="الله_عز_وجل.png";
Empty set (0.00 sec)

(u3518@wikireplica-web.eqiad.wmnet) [arwiki_p]> use enwiki_p;
Database changed

(u3518@wikireplica-web.eqiad.wmnet) [enwiki_p]> SELECT 1 AS "Exists" FROM arwiki_p.image WHERE img_name="الله_عز_وجل.png";
Empty set (0.00 sec)

The new replica servers are up and running and fully populated with data for all of the wikis. The last thing that is really blocking their announcement to the Toolforge and Cloud VPS communities is T174860: Define naming scheme for connecting to new wiki replica cluster. wikireplica-web.eqiad.wmnet can be used for now as shown in my test queries, but that service name will not be maintained long term.

I have found an apparent replication issue on one of the new servers:

tools.dplbot@tools-bastion-03:~$ mysql -h enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 4455943
Server version: 10.1.28-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [enwiki_p]> select page_title, cl_to from page, categorylinks where page_id = cl_from and cl_from=335652;
+------------+--------------------------------+

page_titlecl_to

+------------+--------------------------------+

CUAll_disambiguation_pages
CUWikipedia_disambiguation_pages

+------------+--------------------------------+
2 rows in set (0.01 sec)

By comparison, the live wiki page [[CU]] at https://en.wikipedia.org/wiki/CU shows links to three categories: "All_disambiguation_pages", "All_article_disambiguation_pages", and "Disambiguation_pages".

root@labsdb1009[enwiki]> select page_title, cl_to from page, categorylinks where page_id = cl_from and cl_from=335652;
+------------+--------------------------------+
| page_title | cl_to                          |
+------------+--------------------------------+
| CU         | All_disambiguation_pages       |
| CU         | Wikipedia_disambiguation_pages |
+------------+--------------------------------+
2 rows in set (0.00 sec)

(db1052 is the enwiki production master)

root@db1052[enwiki]> select page_title, cl_to from page, categorylinks where page_id = cl_from and cl_from=335652;
+------------+--------------------------------+
| page_title | cl_to                          |
+------------+--------------------------------+
| CU         | All_disambiguation_pages       |
| CU         | Wikipedia_disambiguation_pages |
+------------+--------------------------------+
2 rows in set (0.01 sec)

Either there is a cache issue (reload browser to clear the cache), or there is a production problem, but not a replica drift.

On a core/production server it has the same result:

root@PRODUCTION s1 slave[enwiki]>  select page_title, cl_to from page, categorylinks where page_id = cl_from and cl_from=335652;
+------------+--------------------------------+
| page_title | cl_to                          |
+------------+--------------------------------+
| CU         | All_disambiguation_pages       |
| CU         | Wikipedia_disambiguation_pages |
+------------+--------------------------------+
2 rows in set (0.00 sec)

And same on enwiki master:

root@PRODUCTION s1 master[enwiki]> select page_title, cl_to from page, categorylinks where page_id = cl_from and cl_from=335652;
+------------+--------------------------------+
| page_title | cl_to                          |
+------------+--------------------------------+
| CU         | All_disambiguation_pages       |
| CU         | Wikipedia_disambiguation_pages |
+------------+--------------------------------+
2 rows in set (0.00 sec)

So the labs replicas contain the same data that the core servers.

User error :-) please disregard.

We should update https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database/Replica_drift and close this ticket, as it is unlikely that there will be drift again (due to ROW + GTID), unless something very very bad happens. Drift on the old servers will not be corrected, as it was not possible to do the fixes in place.

bd808 assigned this task to jcrespo.

The main description here and https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database/Replica_drift have been updated to reflect our confidence that RBR replication and other changes made for the new Wiki Replica cluster will prevent drift from occurring (or at least from going unnoticed until reported by users).

I am closing this as resolved rather than declined because there was a large amount of work done by @jcrespo and @Marostegui to address the underlying problems rather than just continuing to make point-in-time fixes as new problems were reported. On behalf of all of the users of the Wiki Replica databases I would like to thank them yet again for the effort that they put into this project.