Page MenuHomePhabricator

enwiki_p replica on s1 is corrupted
Closed, ResolvedPublic

Description

The enwiki_p replica on s1 appears to be corrupt. The pagelinks table contains links that no longer exist on the live wiki.

Example (just one of many I could demonstrate):

https://en.wikipedia.org/w/api.php?action=query&list=backlinks&bltitle=KIFM&blnamespace=0&format=jsonfm shows that there is only 1 article linking to the title "KIFM" as of now.

On Tool Labs, the database query copied below returns 220 rows, even though replication lag is currently around 1 second:

MariaDB [enwiki_p]> select page_title from page join pagelinks on pl_from = page_id where page_namespace=0 and pl_namespace = 0 and pl_title="KIFM";

Related Objects

Event Timeline

russblau created this task.May 2 2016, 9:45 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 2 2016, 9:45 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
chasemp triaged this task as High priority.May 24 2016, 9:52 PM
chasemp added a project: DBA.

The replica is not corrupt, it just has drifted from production, failing to delete and insert some records, for several reasons: the main ones is crashing while using non-transactional engines and due to the fact that it does not contain all rows (some have been intentionally omitted for privacy reasons, which can affect certain rows such as revision on deletion/undeletion).

There is an ongoing reimport of all labs data from production that will fix those issues: T126946#2317370 the revision table is ongoing and the pagelinks is the last one to reimport.

Again, I ask patience because they are thousands of millions of records to not only import one by one, but to filter, which takes time- but we have already corrected 90% of the rows on enwiki. The other wikis will go next.

jcrespo claimed this task.May 25 2016, 7:25 AM
jcrespo moved this task from Triage to In progress on the DBA board.
Tb added a subscriber: Tb.May 25 2016, 8:55 AM
valhallasw moved this task from Triage to In Progress on the Toolforge board.May 27 2016, 1:17 PM
russblau added a comment.EditedNov 5 2016, 2:04 PM

Is there any update on the status of the database? The erroneous entries reported back in May are still an issue. It has been nearly six months; is the reimport complete yet? If not, is there an ETA?

Note that the "KIFM" example in my original post is no longer valid because the page has since been moved. You can replace KIFM with "Will_Johnson" for a current example. On the live wiki, there are 4 incoming links; on Tool Labs, there are 7.

@russblau See my comment at T138967#2775777. Expanding on that, import "in place" created lots of disruption (replication lag, which other users complained about). The decision taken was to buy newer, faster servers and load those in parallel to the existing service (that is why no change seems visible on the current servers anymore).

The ETA for making available the new servers is the end of the natural year (hopefully before that). The reason why it is taking so much time is that this time we are also converting the tables to a format (InnoDB) for which reimporting from production should be much easier if problems reappear; plus trying to fix those problems in the first place. Converting that data is a very slow process, but it will be worth, and I expect to have over 5x the current labs capacity.

This is a Labs goal, which means it is a #1 priority for both Labs and DBAs.

scfc added a subscriber: scfc.Dec 4 2016, 6:35 PM

@jcrespo: What is the Phabricator task of DBA that covers the work needed to resolve this task? T147052? If so, could you please add it as a blocking task to this task? That way, questions à la "Are we there yet?" answer themselves :-).

@scfc This is technically solved on the new servers, enwiki is fine there, but we most likely will not fix them on the current servers.

When run on enwiki production master I get:

225 rows in set (0.11 sec)

When run on the new labsdb servers, I get:

225 rows in set (0.01 sec)

Notice the query even returns 10 times faster, which seems about right.

I can see that the api call returns much less results (I do not know what exactly the API does or if the problem is in production, but I can tell now with all confidence that enwiki_p replica service is no longer corrupted on the new (and soon, only) servers.

I will wait for feedback to see if we resolve this; if we declare it as invalid, or if you want to rename it (or open a new ticket) for your API issue. The problem was existing before, I do not deny that (I fixed that bit), but please note that many things on mediawiki are updated asynchronously, so the database does not necessarily refect the current edition state (parsing and updating things that are not pure wikitext take X seconds-minutes-hours after the edit happens).

The new servers are not yet properly documented, but feel free to reach me if you need the corrects results now and cannot wait for an official announcement and documentation.

jcrespo closed this task as Resolved.Jan 24 2017, 2:56 PM

Could you please share the configuration details of new servers? Most of my tools like this (running on trwiki, c1 replica) are still showing outdated data. Thanks.

@Superyetkin I cannot guarantee it will not change in the future, but you can connect, in the case of enwiki to the labsdb-web.eqiad.wmnet host for short-lived, web-like requests. labsdb-analytics.eqiad.wmnet for long-running, analytics-like requests. You use the same user and password that you currently use for the current replica service, on the replica.my.cnf file. Not local writes are yet allowed.

Please provide feedback, both positive or negative (it would be highly valuable to discover problem), although I would suggest to do so on a separate ticket for organization reasons.

I am getting timeout errors on labsdb-web.eqiad.wmnet .

I am getting timeout errors on labsdb-web.eqiad.wmnet .

Can you give us some more details? From where, etc?

Here is a page where the connection method (mysql_connect) is being called.

also, does this mean labsdb-analytics.eqiad.wmnet works for you?

That's now resolved, but I still have trouble logging in:

krenair@tools-bastion-03:~$ mysql -h labsdb1001.eqiad.wmnet
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 94024977
Server version: 10.0.15-MariaDB Source distribution

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> quit
Bye
krenair@tools-bastion-03:~$ mysql -h labsdb-web.eqiad.wmnet
ERROR 1045 (28000): Access denied for user 'u2170'@'10.64.37.15' (using password: YES)

Woah, wait, what? I can get in as s52299 but not as u2170?

tools.alex@tools-bastion-03:~$ mysql -h labsdb-web.eqiad.wmnet
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 379663
Server version: 10.1.21-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>