Page MenuHomePhabricator

Entries in revision table where rev_len is NULL
Open, LowPublic

Description

There are entries in the revision table on enwiki where rev_len is NULL. This is rather silly, and also, currently causes the {{PAGESIZE:}} parser function to return unexpected results for affected pages (T20998). We should probably fix them. (Originally reported at https://en.wikipedia.org/wiki/Template_talk:Ifexist_not_redirect#Bug:_Expression_error:_Unexpected_.3E_operator.)

I did some investigating and it looks like this was caused by a bug in 429ae88d (part of ContentHandler branch, merged into master in 329d5b35) and resolved in 4fb4ad91. Thus affected revisions in production should be limited to time between 1.21wmf2 and 1.25wmf2 – 2012-10-15 to 2014-10-10, and the correct rev_len value for them all is simply 0.

This would be trivial to fix with a maintenance script (we even have one, might need tweaks though: populateRevisionLength.php), but since there are no useful indexes, this would require a full table scan and probably take days.

Details

Related Gerrit Patches:

Event Timeline

matmarex created this task.May 16 2016, 6:03 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 16 2016, 6:03 PM

This would be trivial to fix with a maintenance script (we even have one, might need tweaks though: populateRevisionLength.php), but since there are no useful indexes, this would require a full table scan and probably take days.

I don't think it would take days. If there's only a (relatively) small number of revisions, you only have to do the rather expensive filesorting query a couple times. So using a query like SELECT * FROM revision where rev_length is null ORDER BY rev_id LIMIT 1000 to get batches would be fine, particularly if it was done against the "vslow" slave. As long as there aren't like a million of such entries, this wouldn't take all that long to do.

This would be trivial to fix with a maintenance script (we even have one, might need tweaks though: populateRevisionLength.php), but since there are no useful indexes, this would require a full table scan and probably take days.

I don't think it would take days. If there's only a (relatively) small number of revisions, you only have to do the rather expensive filesorting query a couple times. So using a query like SELECT * FROM revision where rev_length is null ORDER BY rev_id LIMIT 1000 to get batches would be fine, particularly if it was done against the "vslow" slave. As long as there aren't like a million of such entries, this wouldn't take all that long to do.

There isn't actually a filesort involved. There are just very few rows with this bug (a couple per day in the timespan known to be affected). A query like that could take a couple hours to return the first result. I don't really know if that would be okay on slaves, but my gut reaction is "oh god no". But if it is okay, that'd be great.

The script will currently do something more akin to SELECT * FROM revision where rev_len is null and rev_id>=0 and rev_id<1000 (increasing the rev_ids until it goes through the whole table), which takes a couple seconds each. Assuming each query for 1000 rows would take 4 seconds (that looks ±50% correct from my tests on Analytics slaves), times 721549703 rev_ids on enwiki, gives us about a month.

In the meantime I'll change the script to do these queries on slaves…

Change 290142 had a related patch set uploaded (by Bartosz Dziewoński):
populateRevisionLength: Read from slaves when possible

https://gerrit.wikimedia.org/r/290142

There isn't actually a filesort involved. There are just very few rows with this bug (a couple per day in the timespan known to be affected). A query like that could take a couple hours to return the first result. I don't really know if that would be okay on slaves, but my gut reaction is "oh god no". But if it is okay, that'd be great.

Right, so hitting a significant number of rows. Its probably not ok on a normal slave. But if the slave is a "vslow" slave, that will probably be fine (Certainly this is not anywhere near as bad as the stuff that the query pages do).

Change 290142 merged by jenkins-bot:
populateRevisionLength: Read from slaves when possible

https://gerrit.wikimedia.org/r/290142

@Volans @jcrespo How do you feel about running populateRevisionLength.php --batch-size=10000 --force across all wikis some time? It might take a month or so. Alternatively, how do you feel about @Bawolff's idea above? That might take less time overall, but individual queries will be very long.

I recommend a lecture of T135278, which is very different from this, but they had similar needs- a maintenance script that read the whole revision tables and get a few results.

I have not fully checked the script but it should:
a) use your proposed rev_id>=0 and rev_id<1000 restriction for proper indexing (doing ORDER ... LIMIT would do a full table scan every time)
b) use the vslow slave for all reads, only the master for writes
c) check regularly the database configuration- the vslow node could change many times in the course of a month, for many reasons, so every run with the same config should not take more than a few hours
d) wait for lag after every run

Do not underestimate doing a full scan of the revision table, for enwiki it is a 200GB table, and once it starts reading from disk, it will be very slow.

matmarex claimed this task.Jul 18 2016, 8:14 PM
matmarex triaged this task as Low priority.
matmarex removed matmarex as the assignee of this task.Jan 10 2017, 11:14 PM

I can't seem to find the time to look into this myself.

1978Gage2001 moved this task from Triage to In progress on the DBA board.Dec 11 2017, 9:46 AM
Marostegui moved this task from In progress to Triage on the DBA board.Dec 11 2017, 10:57 AM
Marostegui edited projects, added Wikimedia-Rdbms; removed DBA.Oct 1 2018, 7:30 AM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 18 2019, 9:43 PM