Page MenuHomePhabricator

Maintenance.php function purgeRedundantText() not able to deal with big data set
Open, MediumPublic

Description

deleteOldRevisions.php calls Maintenance.php function purgeRedundantText().

Running in on a big DB (more than 200.000 pages) the script dies like following:

Il database ha restituito il seguente errore "1153: Got a packet bigger than
'max_allowed_packet' bytes (localhost)".

I think the SQL reauests including "NOT IN ($set)" or "IN ($set)" are responsible for that because they can only accept a few hundreds or thousands of ids.

Please confirm if I'm right.


Version: 1.16.x
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:50 PM
bzimport set Reference to bz20651.
bzimport added a subscriber: Unknown Object (MLST).

Chad confirmed this. Basically need to refactor so that the function handles smaller chunks of data.

Ciencia_Al_Poder subscribed.

Ugh, just found this problem. It wasn't breaking, but nukePage was taking a lot to delete one single page, and looked at the source code to see why.

Jesus Christ. What I found is horrible. This is Row By Agonizing Row programming, but copying everything in memory, and then construct an old_id NOT IN ( <insert several million of comma separated integers here> ), and expect the server not to choke on this big query.

I'd like to see what happens if someone runs nukePage.php on the English Wikipedia database... Still current as of MediaWiki 1.31: https://phabricator.wikimedia.org/source/mediawiki/browse/REL1_31/maintenance/Maintenance.php$1268

Change 649123 had a related patch set uploaded (by Ammarpad; owner: Ammarpad):
[mediawiki/core@master] Make Maintenance::purgeRedundantText() resilient to large data set

https://gerrit.wikimedia.org/r/649123

Change 649123 abandoned by Ammarpad:

[mediawiki/core@master] Make Maintenance::purgeRedundantText() resilient to large data set

Reason:

old patch

https://gerrit.wikimedia.org/r/649123