Page MenuHomePhabricator

Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas)
Closed, InvalidPublic

Description

Thousands of rows are missing, at least in recentchanges, projects eswiki, enwiki and fawiki (probably more), and date 2020-05-30, see:

https://quarry.wmflabs.org/query/45489 eswiki
https://quarry.wmflabs.org/query/45491 fawiki
https://quarry.wmflabs.org/query/45492 enwiki

Originally pasted in P11348; thanks to @AntiCompositeNumber for the checking and queries to get this problem "completed"

Event Timeline

-jem- triaged this task as High priority.Jun 1 2020, 8:41 PM
AntiCompositeNumber raised the priority of this task from High to Needs Triage.Jun 1 2020, 8:44 PM
AntiCompositeNumber edited projects, added Data-Services; removed DBA.
bd808 added a subscriber: bd808.

Tagging with DBA to get some ideas of what might be happening here. It may be somehow related to the ongoing work on T249188: Reimage labsdb1011 to Buster and MariaDB 10.4, but I do not know how to verify or disprove that.

The issue seems to be roughly constant per day over the past month (https://quarry.wmflabs.org/query/45496). Doesn't really help figure out when it started, other than knowing it was at least a month ago.

The issue seems to be roughly constant per day over the past month (https://quarry.wmflabs.org/query/45496). Doesn't really help figure out when it started, other than knowing it was at least a month ago.

I get the same results from a production host, so maybe the flaw here is in the test itself?

wikiadmin@10.64.48.153(eswiki)> SELECT LEFT(rev_timestamp, 8), COUNT(*)
    -> FROM revision
    -> LEFT JOIN recentchanges ON rev_id = rc_cur_id
    -> WHERE
    -> rev_timestamp > 20200502000000
    ->     AND rc_cur_id IS NULL
    -> GROUP BY LEFT(rev_timestamp, 8);
+------------------------+----------+
| LEFT(rev_timestamp, 8) | COUNT(*) |
+------------------------+----------+
| 20200502               |    26658 |
| 20200503               |    27095 |
| 20200504               |    28330 |
| 20200505               |    27667 |
| 20200506               |    27344 |
| 20200507               |    29112 |
| 20200508               |    28239 |
| 20200509               |    26335 |
| 20200510               |    25620 |
| 20200511               |    28317 |
| 20200512               |    28252 |
| 20200513               |    26782 |
| 20200514               |    28960 |
| 20200515               |    26959 |
| 20200516               |    23789 |
| 20200517               |    25595 |
| 20200518               |    28124 |
| 20200519               |    32133 |
| 20200520               |    27707 |
| 20200521               |    27984 |
| 20200522               |    26246 |
| 20200523               |    27008 |
| 20200524               |    28035 |
| 20200525               |    30325 |
| 20200526               |    30597 |
| 20200527               |    29350 |
| 20200528               |    29615 |
| 20200529               |    24893 |
| 20200530               |    23566 |
| 20200531               |    26182 |
| 20200601               |    28105 |
| 20200602               |     1416 |
+------------------------+----------+
32 rows in set (1 min 50.72 sec)

Doh! That should be rc_this_oldid not rc_cur_id. rc_cur_id is a key to page_id, not rev_id.
https://quarry.wmflabs.org/query/45496 and the other queries should be correct now. The numbers are a lot lower (which is good) but still extend through the whole month.

Focusing on the Nocem_Collado page from P11348, I see this on the wiki replicas:

analytics cluster
MariaDB [eswiki_p]> select rc_timestamp, rc_type, comment_text from recentchanges left join actor on rc_actor = actor_id left join comment on rc_comment_id = comment_id where rc_timestamp > '20200530000000' and actor_name = 'Ytha67' and rc_title = 'Nocem_Collado' order by rc_timestamp;
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| rc_timestamp   | rc_type | comment_text                                                                                              |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| 20200530172228 |       0 | Añadiendo la [[Categoría:Directoras de cine de Espa
ña]] mediante [[Wikipedia:HotCat|HotCat]]              |
| 20200530172521 |       5 | Un elemento Wikidata se ha vinculado a esta página.                                                       |
| 20200530172701 |       1 | interés wikiproyecto mujeres                                                                              |
| 20200530172731 |       0 | Añadiendo la [[Categoría:Fotógrafas de España]] mediante [[Wikipedia:HotCat|HotCat]]                      |
| 20200530172853 |       0 | Añadiendo la [[Categoría:Directores de cine documental]] mediante [[Wikipedia:HotCat|HotCat]]             |
| 20200530172916 |       0 | Añadiendo la [[Categoría:Realizadores de documentales de España]] mediante [[Wikipedia:HotCat|HotCat]]    |
| 20200530172940 |       0 | Añadiendo la [[Categoría:Granadinos]] mediante [[Wikipedia:HotCat|HotCat]]                                |
| 20200530173531 |       0 | correcciones                                                                                              |
| 20200530173933 |       0 | ajustes plantilla NF (nacida en siglo XX)                                                                 |
| 20200530174124 |       0 | Añadiendo la [[Categoría:Españolas del siglo XXI]] mediante [[Wikipedia:HotCat|HotCat]]                   |
| 20200530174454 |       5 | /* wbsetclaim-create:2||1 */ [[Property:P569]]: 20. century                                               |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
11 rows in set (11.47 sec)
web cluster
MariaDB [eswiki_p]> select rc_timestamp, rc_type, comment_text from recentchanges left join actor on rc_actor = actor_id left join comment on rc_comment_id = comment_id where rc_timestamp > '20200530000000' and actor_name = 'Ytha67' and rc_title = 'Nocem_Collado' order by rc_timestamp;
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| rc_timestamp   | rc_type | comment_text                                                                                              |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| 20200530172228 |       0 | Añadiendo la [[Categoría:Directoras de cine de Espa
ña]] mediante [[Wikipedia:HotCat|HotCat]]              |
| 20200530172521 |       5 | Un elemento Wikidata se ha vinculado a esta página.                                                       |
| 20200530172701 |       1 | interés wikiproyecto mujeres                                                                              |
| 20200530172731 |       0 | Añadiendo la [[Categoría:Fotógrafas de España]] mediante [[Wikipedia:HotCat|HotCat]]                      |
| 20200530172853 |       0 | Añadiendo la [[Categoría:Directores de cine documental]] mediante [[Wikipedia:HotCat|HotCat]]             |
| 20200530172916 |       0 | Añadiendo la [[Categoría:Realizadores de documentales de España]] mediante [[Wikipedia:HotCat|HotCat]]    |
| 20200530172940 |       0 | Añadiendo la [[Categoría:Granadinos]] mediante [[Wikipedia:HotCat|HotCat]]                                |
| 20200530173531 |       0 | correcciones                                                                                              |
| 20200530173933 |       0 | ajustes plantilla NF (nacida en siglo XX)                                                                 |
| 20200530174124 |       0 | Añadiendo la [[Categoría:Españolas del siglo XXI]] mediante [[Wikipedia:HotCat|HotCat]]                   |
| 20200530174454 |       5 | /* wbsetclaim-create:2||1 */ [[Property:P569]]: 20. century                                               |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
11 rows in set (10.16 sec)

And in production:

prod replica
wikiadmin@10.64.32.198(eswiki)> select rc_timestamp, rc_type, comment_text from recentchanges left join actor on rc_actor = actor_id left join comment on rc_comment_id = comment_id where rc_timestamp > '20200530000000' and actor_name = 'Ytha67' and rc_title = 'Nocem_Collado' order by rc_timestamp;
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| rc_timestamp   | rc_type | comment_text                                                                                              |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| 20200530172228 |       0 | Añadiendo la [[Categoría:Directoras de cine de Espa
ña]] mediante [[Wikipedia:HotCat|HotCat]]              |
| 20200530172521 |       5 | Un elemento Wikidata se ha vinculado a esta página.                                                       |
| 20200530172701 |       1 | interés wikiproyecto mujeres                                                                              |
| 20200530172731 |       0 | Añadiendo la [[Categoría:Fotógrafas de España]] mediante [[Wikipedia:HotCat|HotCat]]                      |
| 20200530172853 |       0 | Añadiendo la [[Categoría:Directores de cine documental]] mediante [[Wikipedia:HotCat|HotCat]]             |
| 20200530172916 |       0 | Añadiendo la [[Categoría:Realizadores de documentales de España]] mediante [[Wikipedia:HotCat|HotCat]]    |
| 20200530172940 |       0 | Añadiendo la [[Categoría:Granadinos]] mediante [[Wikipedia:HotCat|HotCat]]                                |
| 20200530173531 |       0 | correcciones                                                                                              |
| 20200530173933 |       0 | ajustes plantilla NF (nacida en siglo XX)                                                                 |
| 20200530174124 |       0 | Añadiendo la [[Categoría:Españolas del siglo XXI]] mediante [[Wikipedia:HotCat|HotCat]]                   |
| 20200530174454 |       5 | /* wbsetclaim-create:2||1 */ [[Property:P569]]: 20. century                                               |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)
prod primary
wikiadmin@10.64.16.104(eswiki)> select rc_timestamp, rc_type, comment_text from recentchanges left join actor on rc_actor = actor_id left join comment on rc_comment_id = comment_id where rc_timestamp > '20200530000000' and actor_name = 'Ytha67' and rc_title = 'Nocem_Collado' order by rc_timestamp;
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| rc_timestamp   | rc_type | comment_text                                                                                              |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
| 20200530172228 |       0 | Añadiendo la [[Categoría:Directoras de cine de Espa
ña]] mediante [[Wikipedia:HotCat|HotCat]]              |
| 20200530172521 |       5 | Un elemento Wikidata se ha vinculado a esta página.                                                       |
| 20200530172701 |       1 | interés wikiproyecto mujeres                                                                              |
| 20200530172731 |       0 | Añadiendo la [[Categoría:Fotógrafas de España]] mediante [[Wikipedia:HotCat|HotCat]]                      |
| 20200530172853 |       0 | Añadiendo la [[Categoría:Directores de cine documental]] mediante [[Wikipedia:HotCat|HotCat]]             |
| 20200530172916 |       0 | Añadiendo la [[Categoría:Realizadores de documentales de España]] mediante [[Wikipedia:HotCat|HotCat]]    |
| 20200530172940 |       0 | Añadiendo la [[Categoría:Granadinos]] mediante [[Wikipedia:HotCat|HotCat]]                                |
| 20200530173531 |       0 | correcciones                                                                                              |
| 20200530173933 |       0 | ajustes plantilla NF (nacida en siglo XX)                                                                 |
| 20200530174124 |       0 | Añadiendo la [[Categoría:Españolas del siglo XXI]] mediante [[Wikipedia:HotCat|HotCat]]                   |
| 20200530174454 |       5 | /* wbsetclaim-create:2||1 */ [[Property:P569]]: 20. century                                               |
+----------------+---------+-----------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)

The database results all match each other, but they don't match the page history output. This makes me think that there is something not right about the question being asked.

I just did a quick check on eswiki and fawiki and the number of rows is exactly the same in production than in sanitarium (master for labsdbhosts) and on labsdb1012

root@cumin1001:/home/marostegui# for i in db1079 db1125:3317 labsdb1012; do echo $i; mysql.py -h$i -e "select count(*) from fawiki.recentchanges; select count(*) from eswiki.recentchanges"; done
db1079
+----------+
| count(*) |
+----------+
|   680762 |
+----------+
+----------+
| count(*) |
+----------+
|  1852326 |
+----------+
db1125:3317
+----------+
| count(*) |
+----------+
|   680762 |
+----------+
+----------+
| count(*) |
+----------+
|  1852326 |
+----------+
labsdb1012
+----------+
| count(*) |
+----------+
|   680762 |
+----------+
+----------+
| count(*) |
+----------+
|  1852326 |
+----------+

That along with @bd808 results make me doubt if this is really a data (in)consistency issue.

I made some deeper research and this is a misunderstanding of how recentchanges and revision tables work. While at first one could think that recentchanges is just a smaller version of the revision table, there is no 1:1 relationship between them, both ways:

  • Recentchanges contains data that is not on revision: category updates, external edits from wikidata, logs (among others) are sent to recentchanges and they don't have an equivalent on revision, as technically, for those, no local, user-driven edits were created
  • Revision contains data that is not on recentchanges: When a page is moved or deleted, old records on recentchanges can become orphan, or be deleted. While recentchanges can be used in some cases, a clean reconstruction of what happened is better done through revision and log tables only, as recentchanges only care about instant actions for recent review, and not historical consistency.

For example, revision contains data of https://es.wikipedia.org/w/index.php?title=Wikipedia:Una_mujer&action=history but because that page was moved away from https://es.wikipedia.org/w/index.php?title=Usuaria:Jaluj/Taller/Una_mujer&action=edit&redlink=1 and then deleted, all rcs pointing to it where also deleted.

I would close this as invalid- database replicas are *not missing data*, they have the same as production. I checked and this only affects moved/deleted pages.

IF recentchanges for a page should not be deleted on page deletion is something that I cannot say if intentional or not, as the revisions to that page could have been moved elsewhere. I suggest that if you believe that to be bug, to file a ticket for this behaviour to change on production aka "mediawiki core", but in terms of wikireplica service we can only offer the same data as in production. I would agree that this behaviour is not necessarily cleanly documented or easy to follow except by reading the source code or checking the database binlogs. :-)

I believe the analytics team is working on tools for cleaner history reconstruction, I recommend you to ask on the Wikimedia analytics mailing list to ask how they do that as an alternative.

bd808 renamed this task from Missing data in database replicas to Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas).Jun 2 2020, 3:20 PM
bd808 closed this task as Invalid.

To summarize what happens:

  1. Edits happen on page P
  2. Page P is renamed to page Q, a redirect is left behind. All revisions keep existing, but they now belong to Q
  3. Page Q is deleted, it gets archived/non visible
  4. As a consequence of 2 or 3, probably the latter, recentchanges entries from P are deleted

In the general case, edits of deleted pages should be deleted when a page is deleted and only leaving behind a rc reference to the deletion log, but I wonder if they should be kept if the revisions are still visible? Someone with better MW knowledge should judge if that is intended (because it is not easy to match those revisions) or if rc could be updated to point to thew new page. What if a partial restore happens? A bit of an edge case that I see developers not wanting to support.

Thanks for your complete explanation, @jcrespo. If I have understood, edits previous to a page move can't be tracked with recentchanges even if they are "recent" and viewable in the history, so I will have to fix accordingly my statistics bot for edit-a-thons... this had never come up after several years (I guess nobody moved sandboxes to the main namespace, or nobody noticed the problem and warned about it), so I thought in a bug and they didn't "stopped me" in irc; sorry about that.

@-jem-: I think using revision + log would be the best way for that kind of analysis. In any case, there is always a way to alter history after the fact (deletes, merges, restores), which is both necessary but problematic for historical analysis.

sorry about that.

Don't be sorry. As I said above, I am not sure it is not a bug or not, what I meant is the ticket as reported "missing data in database replicas" is non-actionable by our teams (persistence and cloud)- data is equivalent to production and coded with that behavior. I encourage you to create a ticket with T254193#6185442 to the developers (MediaWiki-Recent-changes ). This was only closed for the scope of the initial report (DBA s/Data-Services) as ourselves don't have anything to recover/change.