Page MenuHomePhabricator

Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows)
Closed, ResolvedPublic

Description

Possibly a duplicate of T133469: Discrepancy between labsdb replicas of arwiki_p.user_groups.

MariaDB [enwiki_p]> select ug_group from user_groups join user on user_id = ug_user and user_name = 'Train2104';
+----------------+
| ug_group       |
+----------------+
| filemover      |
| patroller      |
| reviewer       |
| rollbacker     |
| templateeditor |
+----------------+
5 rows in set (0.00 sec)

Is missing "extendedconfirmed". Compare with https://en.wikipedia.org/wiki/Special:ListUsers/Train2104, which reads, in part:

Train2104 (talk | contribs)‏‎ (extended confirmed user, file mover, new page reviewer, pending changes reviewer, rollbacker, template editor)

The analytics slaves are apparently better:

[12:27] <MatmaRex> mysql:research@analytics-store.eqiad.wmnet [enwiki]> select ug_group from user_groups join user on user_id = ug_user an
[12:27] <MatmaRex> d user_name = 'Train2104';
[12:27] <MatmaRex> +-------------------+
[12:27] <MatmaRex> | ug_group          |
[12:27] <MatmaRex> +-------------------+
[12:27] <MatmaRex> | extendedconfirmed |
[12:27] <MatmaRex> | filemover         |
[12:27] <MatmaRex> | patroller         |
[12:27] <MatmaRex> | reviewer          |
[12:27] <MatmaRex> | rollbacker        |
[12:27] <MatmaRex> | templateeditor    |
[12:27] <MatmaRex> +-------------------+
[12:27] <MatmaRex> 6 rows in set (0.02 sec)

Reported here: https://en.wikipedia.org/w/index.php?title=User_talk:BernsteinBot&oldid=768156072#Train2104_not_shown_as_extended_confirmed.

I'll now patiently wait for @jcrespo to come along and tell me how row-based replication will one day make this better. When is that day, exactly?

@bd808: Since you're continuously curious about Labs frustrations, the replicas are definitely top-five.

Event Timeline

Hi,

Indeed, the new labs servers (running ROW based replication are fine) and that drift is probably coming from multiple crashes, rebuilds and who knows what on the old labsdb servers.
This is one of the new labs servers:

mysql:root@localhost [enwiki_p]> select ug_group from user_groups join user on user_id = ug_user and user_name = 'Train2104';
+-------------------+
| ug_group          |
+-------------------+
| extendedconfirmed |
| filemover         |
| patroller         |
| reviewer          |
| rollbacker        |
| templateeditor    |
+-------------------+
6 rows in set (0.00 sec)

ROW based replication is now a reality on the new labsdb and new sanitarium and they are currently holding: s1, s3 and s4, all running ROW-based replication, which, as you said, will prevent drifts like this in the future by ensuring data integrity (or replication would break otherwise). We'll try to import another shards soon, but it is a task that takes time. We are doing our best. You can follow the progress on which and how we import shards here: T153743

Cheers!

Since you're continuously curious about Labs frustrations, the replicas are definitely top-five.

Noted. We are really hoping that the new db servers and the row based replication strategy will help with many of the 'drift' issues. The proof of this of course will be the actual experience that people have after switching everything over.

Since you're continuously curious about Labs frustrations, the replicas are definitely top-five.

Noted. We are really hoping that the new db servers and the row based replication strategy will help with many of the 'drift' issues. The proof of this of course will be the actual experience that people have after switching everything over.

All the drifts issues that have been reported lately, are only happening on the old labsdb servers, the new ones aren't suffering this issue :-)

@MZMcBride , why wait when you can go NOW and test the new servers? As many told you, enwiki is there now and fixed. :-)

jcrespo claimed this task.
root@labsdb1001[enwiki_p]> select ug_group from user_groups join user on user_id = ug_user and user_name = 'Train2104';
+-------------------+
| ug_group          |
+-------------------+
| extendedconfirmed |
| filemover         |
| patroller         |
| reviewer          |
| rollbacker        |
| templateeditor    |
+-------------------+
6 rows in set (0.00 sec)

@MZMcBride , why wait when you can go NOW and test the new servers? As many told you, enwiki is there now and fixed. :-)

I always use sql enwiki_p to connect. Should I be using a different command?

Related question: in scripts, I connect to enwiki_p for the database name and enwiki.labsdb for the database host name. Should I update my scripts to point elsewhere?

Related question: in scripts, I connect to enwiki_p for the database name and enwiki.labsdb for the database host name. Should I update my scripts to point elsewhere?

We are going to maintain the enwiki_p database names, that is sure. We will mostly likely maintain enwiki.labsdb as the host, but we will not change those to point to the new service for now, until people say it works properly.

For a limited amount of time (because it is considered still beta, and unannounced, undocumented, and can be changed at any time in the future) you will be able to connect to the new servers at "labsdb-web.eqiad.wmnet" (for fast, lag-less, concurrent requests) and "labsdb-analytics.eqiad.wmnet" (for slow, but not concurrent queries). Those are not public meaning that the dns may be change or be disabled the the future, but they can be tested now and you can provide feedback if something is wrong. It uses the same authentication than the current, old servers. They use ROW based replication, and right now they are much more reliable and fast (I would say data 100% reliable compared to production, but that is yet to be confirmed). We are slowing down the fixes for the old ones because we are working instead on making sure the new ones will never broken in the first place. We have not loaded all shards- only enwiki, s3 an commons; wikidata is next.

You can follow the whole process at: T140788 and the load in particular at T153743.

When we are fully happy on how things are setup, we will do proper documentation (in general it should be "keep doing the same") and an official announcement.