Labs database replica drift
Open, NormalPublic

Description

https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database/Replica_drift

Please report database drift/inconsistencies as a comment on this task.

Please mention:

  • the SQL query, preferably in a format that is easy to compare between production and labs (count(*), md5sum, ...)
  • the actual result
  • the expected result (e.g. based on the live wiki, or a rough indication of why the query is wrong)

Related tasks:
T74413: Different results with queries in labs versus production
T106470: Tool Labs enwiki_p replicated database missing rows
T115207: Lots of rows are missing from enwiki_p.`revision`
T115517: Data missing from June 11/12 on s3.labsdb
T118095: Missing rows in revision table of enwiki.labsdb (data integrity issue)
T126946: disk failure on labsdb1002
T129432: Lost database changes on s2 for 3 hours on labs replicas
T133469: Discrepancy between labsdb replicas of arwiki_p.user_groups
T133715: Missing data on labs replica database
T134203: enwiki_p replica on s1 is corrupted
T136618: Wrong page title in labs database replica enwiki page table
T138927: nlwiki Labs replica with problems
T153058: LabsDB infrastructure pending work

NOTE: Status as of June 20, 2017:

Users can connect to labsdb-web.eqiad.wmnet or labsdb-analytics.eqiad.wmnet to test out better replicas.

Is there a task tracking the updating of the DNS entries such as enwiki.labsdb to point to the new servers?

No, we are not ready yet because not all wikis are available yet, but you can follow the meta ticket at: T153058

We will be finishing the reimports first, then announcing it as beta first "opt-in", and later changing the dns defaults.

Related Objects

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 29 2016, 5:22 PM
Anomie added a subscriber: Anomie.Jun 29 2016, 6:23 PM

May I ask you to please copy the query here to check the drift? We keep here the older subtasks only because they are older than this one, but following a single format will help speed up the resolution of all issues (it would be almost impossible to handle one ticket per row with issues). Thank you in advance- I am just trying to respond quickly to all issues!

russblau added a subscriber: russblau.EditedNov 5 2016, 2:11 PM
MariaDB [enwiki_p]> select page_title from page join pagelinks on pl_from = page_id where page_namespace=0 and pl_namespace = 0 and pl_title="Will_Johnson";
+----------------------------------+
| page_title                       |
+----------------------------------+
| Denver_Broncos                   |
| List_of_current_AFC_team_rosters |
| Will_Johnson_(disambiguation)    |
| William_Johnson_(rugby_player)   |
| Will_Johnson_(American_football) |
| Johnson,_Will                    |
| 2016_Denver_Broncos_season       |
+----------------------------------+
7 rows in set (0.01 sec)

The comparable query to the enwiki API yields only 4 incoming links.

https://en.wikipedia.org/w/api.php?action=query&list=backlinks&bltitle=Will_Johnson&blnamespace=0&format=jsonfm

@russblau Thanks for the report- it is 5 as we speak, but it is indeed wrong. This and hopefully all drift issues are fixed on the imports on the new labsdb servers T147052, that I hope they will be soon available.

jcrespo moved this task from Triage to Meta/Epic on the DBA board.Nov 10 2016, 1:16 PM

https://quarry.wmflabs.org/query/5979

This as of 30 Novemeber 2016 is showing 45 rows that the query says are 'orphaned' but when checked on English Wikipedia certainly arent.

Revent added a subscriber: Revent.Jan 25 2017, 1:54 AM

I was asked to mention this here, although I'm unsure if it's actually a 'replication' issue.

https://quarry.wmflabs.org/query/14916 reports 35 entries in the transcode table, all for files that were uploaded to Commons in 2013, and then renamed while still being transcoded. This created 'orphaned' entries in the transcode table.

AFAICT, the bug that caused this seems to have been long ago fixed, as the examples here are all from 2013. Before the recent cleanup that's been going on with old failed transcodes, there were many other entries here, all related to files that had been deleted while being transcoded. These were easily fixed by simply undeleting and redeleting the file, a couple of months ago. The entries were all, also, from years ago... I suspect these were the relics of fixed bugs, in the database.

There seems to be no way from the 'admin' end to fix these entries, as they are for redirects, not videos.... they could probably usefully simply be deleted from the DB, and then we can watch to see if it recurs.

The particular search is relevant, however, as both times recently where the video scaler system was reset (last month, and earlier this month) all 'running' transcodes were populated into it.... they were shown as 'completed' on the file pages, but there were reports (such as T154186) that supposedly successfully transcoded videos would not play. Requeueing the videos, so that they were transcoded again, seemed to fix the issue.

Resetting a transcode, while it is running, also results in it ending up in this search.

This also seems to be a bug, but a somewhat useful one.

@Revent: As the production database servers return the same result, this has nothing to do with labs replication.

@Revent then I missunderstood you, we should file a proper, standalone bug.

-jem- added a subscriber: -jem-.May 15 2017, 1:27 PM

I have found six eswiki deleted articles which still appear in the page table in Labs, located by using this trial-and-error query after one of them was detected in my Labs tool:

MariaDB [eswiki_p]> select page_title from page where page_is_new = 1 and page_touched like '2015061%' and page_namespace = 0 and page_is_redirect = 0 and page_touched = page_links_updated;
+---------------------------------------------------------------------+
| page_title                                                          |
+---------------------------------------------------------------------+
| Cristian_Manea                                                      |
| Morgawr                                                             |
| XFL_Airabonita                                                      |
| Lego_Battles:_Ninjago                                               |
| FIFA_16.                                                            |
| Yessica_Puín_nació_en_1999/07/16_En_la_Ciudad_de_Bogotá-Colombia    |
+---------------------------------------------------------------------+
6 rows in set (9.70 sec)

Fixed:

root@labsdb1003[eswiki_p]> select page_title from page where page_is_new = 1 and page_touched like '2015061%' and page_namespace = 0 and page_is_redirect = 0 and page_touched = page_links_updated;
Empty set (9.90 sec)

root@labsdb1001[eswiki_p]> select page_title from page where page_is_new = 1 and page_touched like '2015061%' and page_namespace = 0 and page_is_redirect = 0 and page_touched = page_links_updated;
Empty set (10.43 sec

There seem to be several pages on English Wikipedia which have been deleted but still appear on the Labs and Analytics Store replicas in the page table. For example:

MariaDB [enwiki_p]> select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever';
+----------+----------------+---------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+
| page_id  | page_namespace | page_title    | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model |
+----------+----------------+---------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+
| 49995280 |              0 | BatissForever |                   |            0 |                0 |           0 | 0.638177166992 | 20160331025325 | 20160330185410     |   712718594 |      212 | wikitext           |
+----------+----------------+---------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+

"several" is too vague- I have fixed the one given:

root@neodymium:~$ ./sql.py -h labsdb1001.eqiad.wmnet enwiki -e "select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'" --no-dry-run

Results for labsdb1001.eqiad.wmnet:3306/enwiki:
0 rows in set (0.00 sec)
root@neodymium:~$ ./sql.py -h labsdb1003.eqiad.wmnet enwiki -e "select * from page where page_namespace = 0 AND page_title LIKE 'BatissForever'" --no-dry-run

Results for labsdb1003.eqiad.wmnet:3306/enwiki:
0 rows in set (0.00 sec)

Here are a few more such deleted pages that I happen to know of:

Labs:

MariaDB [enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 1215
1 row in set (0.04 sec)

MariaDB [enwiki_p]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 149
1 row in set (0.00 sec)

MariaDB [enwiki_p]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 7509
1 row in set (0.96 sec)

Production:

mysql:wikiadmin@db1083 [enwiki]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 15
1 row in set (0.01 sec)

mysql:wikiadmin@db1083 [enwiki]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2
1 row in set (0.00 sec)

mysql:wikiadmin@db1083 [enwiki]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2205
1 row in set (0.03 sec)

To clarify, the only reliable solution is the start using the new servers already (labsdb-web.eqiad.wmnet and labsdb-analytics.eqiad.wnet). See:

root@labsdb1009[enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 15
1 row in set (0.01 sec)

root@labsdb1009[enwiki_p]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2
1 row in set (0.00 sec)

root@labsdb1009[enwiki_p]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2205
1 row in set (0.04 sec)

The fixes I will be doing to conciliate the existing labsdb1001 and 1003 will only be poor hacks, that will break again sortly.

Is there a task tracking the updating of the DNS entries such as enwiki.labsdb to point to the new servers?

No, we are not ready yet because not all wikis are available yet, but you can follow the meta ticket at: T153058

We will be finishing the reimports first, then announcing it as beta first "opt-in", and later changing the dns defaults.

Labs now:

root@labsdb1001[enwiki_p]> select count(*) from page where page_namespace in (10,11) and ( page_title like 'Cite_doi%' OR page_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 15
1 row in set (0.75 sec)

root@labsdb1001[enwiki_p]> select count(*) from redirect where rd_namespace in (10,11) and ( rd_title like 'Cite_doi%' OR rd_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2
1 row in set (0.30 sec)

root@labsdb1001[enwiki_p]> select count(*) from pagelinks where pl_namespace in (10,11) and ( pl_title like 'Cite_doi%' OR pl_title like 'Cite_pmid%' )\G
*************************** 1. row ***************************
count(*): 2205
1 row in set (4.30 sec)

Thank you for the reports, keep reporting important differences as comments here, just remember the goal is to get rid of the unsafe current labs hosts so we do not have to keep doing this :-/.

@jcrespo: I don't have any way to identify all the extra pages on Tool Labs. All I can say is that there are 1032 extra there:

Tool Labs:

MariaDB [enwiki_p]> select count(*) from page;
+----------+
| count(*) |
+----------+
| 42266366 |
+----------+

Production:

mysql:wikiadmin@db1083 [enwiki]> select count(*) from page;
+----------+
| count(*) |
+----------+
| 42265334 |
+----------+

Is there not any way to sync them? This is causing problems for us, but we can work around it if we have to.

Gorthian removed a subscriber: Gorthian.May 24 2017, 7:44 PM
Tbayer added a subscriber: Tbayer.May 24 2017, 9:49 PM

Maybe what we can try is to stop sanitarium2 and sanitarium at the same replication position and import the page table (it is around 7.3G).

@Marostegui - you can do that if you want, but a) it will cause more complains of missing pages and the lag b) it will break in only few days (all tables were reimported for enwiki in a long process that took 4 months, and by the end of it, it had already drifted again).

If Kaldari want a reliable replica, he should start using labsdb-web.eqiad.wmnet or labsdb-analytics.eqiad.wmnet now. Current labsdb hosts have no fix because mediawiki does not support labs correctly, and we have to workaround that with the new replication method.

@jcrespo: Is there (or will there be) a way to access and create user databases on the new servers (as per https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#User_databases)?

@russblau No, new databases are in read only mode. We are unsure how we can support that and how (because of T156869), so for now all new servers are in full read only mode. You can create databases on toolsdb as usual.

bd808 moved this task from Triage to Database on the Labs board.Thu, Jun 8, 4:30 AM

Cross-post from T166091: Labs database corruption:

SELECT pl_namespace, pl_title
FROM page
JOIN pagelinks ON pl_from = page_id
WHERE page_namespace=0 AND page_title="Ajinkya_Rahane"
AND pl_title LIKE "Ankit%";
+--------------+--------------+
| pl_namespace | pl_title     |
+--------------+--------------+
|            0 | Ankit_Sharma |
+--------------+--------------+

From a production database server:

mysql:wikiadmin@db1083 [enwiki]> SELECT pl_namespace, pl_title
    -> FROM page
    -> JOIN pagelinks ON pl_from = page_id
    -> WHERE page_namespace=0 AND page_title="Ajinkya_Rahane"
    -> AND pl_title LIKE "Ankit%";
+--------------+--------------------------+
| pl_namespace | pl_title                 |
+--------------+--------------------------+
|            0 | Ankit_Sharma_(cricketer) |
+--------------+--------------------------+
1 row in set (0.00 sec)
bd808 added a subscriber: Ankry.EditedFri, Jun 9, 11:17 PM

Copied from T167554: Missing plwikisource_p.revision and plwikisource_p.page data on labs replica database by @Ankry:

As this query shows (https://quarry.wmflabs.org/query/19383)
Revisions from 1147609 to 1147764 are missing from plwikisource_p.revision table in labs database.

They do not seem to be deleted, so some Quary queries report incomplete revision lists (or incorrect revision count) for specific pages, eg.:
https://quarry.wmflabs.org/query/19385
vs.
https://pl.wikisource.org/w/index.php?title=Strona:M._Arcta_s%C5%82ownik_ilustrowany_j%C4%99zyka_polskiego_-_Tom_1.djvu/732&action=history

can it be fixed?

LabsDB
$ sql plwikisource_p
MariaDB [plwikisource_p]> SELECT rev_id FROM revision WHERE rev_id = 1147693;
Empty set (0.00 sec)

MariaDB [plwikisource_p]> SELECT MAX(rev_id) FROM revision WHERE rev_id < 1147693;
+-------------+
| MAX(rev_id) |
+-------------+
|     1147608 |
+-------------+
1 row in set (0.00 sec)

MariaDB [plwikisource_p]> SELECT MIN(rev_id) FROM revision WHERE rev_id > 1147693;
+-------------+
| MIN(rev_id) |
+-------------+
|     1147765 |
+-------------+
1 row in set (0.00 sec)
prod
$ sql plwikisource
(wikiadmin@db1077) [plwikisource]> SELECT rev_id FROM revision WHERE rev_id = 1147693;
+---------+
| rev_id  |
+---------+
| 1147693 |
+---------+
1 row in set (0.00 sec)

(wikiadmin@db1077) [plwikisource]> SELECT MAX(rev_id) FROM revision WHERE rev_id < 1147693;
+-------------+
| MAX(rev_id) |
+-------------+
|     1147692 |
+-------------+
1 row in set (0.00 sec)

(wikiadmin@db1077) [plwikisource]> SELECT MIN(rev_id) FROM revision WHERE rev_id > 1147693;
+-------------+
| MIN(rev_id) |
+-------------+
|     1147694 |
+-------------+
1 row in set (0.00 sec)
Ankry added a comment.Sat, Jun 10, 5:56 AM

The revisions missing from plwikisource_p were created between 20160428113148 and 20160428122541.
page table entries created in the same period, eg.
https://pl.wikisource.org/w/index.php?title=Specjalna:Wk%C5%82ad/AkBot&dir=prev&offset=20160428110138&limit=130&target=AkBot
are also missing. (I checked few random from the above list - none is found in labs)

bd808 triaged this task as Normal priority.Tue, Jun 13, 7:25 PM
bd808 added a project: Epic.
bd808 added a comment.Tue, Jun 20, 3:26 AM

Linked T168349: enwiki_p logging vs logging_userindex returning dramatically different results as a child. The report there is pretty long for pasting into this task.

MZMcBride updated the task description. (Show Details)Wed, Jun 21, 12:47 AM

Cross-post from T166091: Labs database corruption:

SELECT pl_namespace, pl_title
 FROM page
 JOIN pagelinks ON pl_from = page_id
 WHERE page_namespace=0 AND page_title="Ajinkya_Rahane"
 AND pl_title LIKE "Ankit%";
 +--------------+--------------+
 | pl_namespace | pl_title     |
 +--------------+--------------+
 |            0 | Ankit_Sharma |
 +--------------+--------------+

This is still happening on the default and busted replicas:

mzmcbride@tools-bastion-03:~$ sql enwiki_p;
MariaDB [enwiki_p]> SELECT pl_namespace, pl_title FROM page JOIN pagelinks ON pl_from = page_id WHERE page_namespace=0 AND page_title="Ajinkya_Rahane" AND pl_title LIKE "Ankit%";
+--------------+--------------+
| pl_namespace | pl_title     |
+--------------+--------------+
|            0 | Ankit_Sharma |
+--------------+--------------+
1 row in set (0.01 sec)

But when connecting to labsdb-web.eqiad.wmnet, there's better data:

mzmcbride@tools-bastion-03:~$ mysql -hlabsdb-web.eqiad.wmnet enwiki_p;
MariaDB [enwiki_p]> SELECT pl_namespace, pl_title FROM page JOIN pagelinks ON pl_from = page_id WHERE page_namespace=0 AND page_title="Ajinkya_Rahane" AND pl_title LIKE "Ankit%";
+--------------+--------------------------+
| pl_namespace | pl_title                 |
+--------------+--------------------------+
|            0 | Ankit_Sharma_(cricketer) |
+--------------+--------------------------+
1 row in set (0.00 sec)

I updated this task's description a bit.