Page MenuHomePhabricator

make "refreshLinks.php --dfn-only" faster (to work on en.wiki)
Closed, ResolvedPublic

Description

Split from T18112#201167 as this is a problem in the script itself.

(Reedy said in T18112#201188)

The original queries take an age, and isn't going to attempt to load it all.

mysql> explain select DISTINCT pl_from from pagelinks LEFT JOIN page ON
pl_from=page_id;
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
| id | select_type | table     | type   | possible_keys | key     | key_len |
ref                      | rows      | Extra                        |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
|  1 | SIMPLE      | pagelinks | index  | NULL          | pl_from | 265     |
NULL                     | 624327870 | Using index; Using temporary |
|  1 | SIMPLE      | page      | eq_ref | PRIMARY       | PRIMARY | 4       |
enwiki.pagelinks.pl_from |         1 | Using index; Distinct        |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
2 rows in set (0.01 sec)

Removing the distinct would make things simpler.. If kept a client side count,
and removed the distint... Would this work for us..


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:10 AM
bzimport set Reference to bz42180.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #1)

See also bug 36195

Bug 36195 seems like an exact duplicate of this bug. Am I missing something?

(In reply to comment #2)

(In reply to comment #1)

See also bug 36195

Bug 36195 seems like an exact duplicate of this bug. Am I missing something?

Looks to be, one way or another

(In reply to comment #2)

(In reply to comment #1)

See also bug 36195

Bug 36195 seems like an exact duplicate of this bug. Am I missing something?

I never understood. :)
Assuming RobLa actually meant to file his bug in Wikimedia product, his request is about running the script in some different way on en.wiki, while this bug is about making the script more efficient so that such care is not needed.
If bug 36195 is fixed, in practice nobody will bother about fixing this, which should be closed; fixing this would imply fixing bug 36195 too.

PleaseStand subscribed.

Uploaded https://gerrit.wikimedia.org/r/#/c/193182/ for review, which should get rid of the "Using temporary".

Change 193182 merged by Aaron Schulz:
refreshLinks.php: Get IDs in batches in deleteLinksFromNonexistent()

https://gerrit.wikimedia.org/r/193182

Change 193564 had a related patch set uploaded (by PleaseStand):
refreshLinks.php: Tweak exit condition in deleteLinksFromNonexistent()

https://gerrit.wikimedia.org/r/193564

Change 193564 merged by jenkins-bot:
refreshLinks.php: Tweak exit condition in deleteLinksFromNonexistent()

https://gerrit.wikimedia.org/r/193564

T38195 was closed, now a shell user needs to run the script on en.wiki and see how long it takes. Then T18112 can be fixed for real, by adding s1 to misc::maintenance::refreshlinks in operations/puppet.

He7d3r set Security to None.

T38195 was closed, now a shell user needs to run the script on en.wiki and see how long it takes.

About two and a half hours:
real 157m10.770s

Then it should be ok to add to crons.

Change 225569 had a related patch set uploaded (by Alex Monk):
Resume running refreshLinks cron on enwiki

https://gerrit.wikimedia.org/r/225569

Change 225569 merged by Ori.livneh:
Resume running refreshLinks cron on enwiki

https://gerrit.wikimedia.org/r/225569

Script needs to be rerun after T107632 is fixed.