Page MenuHomePhabricator

Cyberbot scan of external links of wikirreplicas in a much more eficient manner
Closed, ResolvedPublic

Description

What cyberbot does seems a legitimate query, but it could do the same queries in a slightly different pattern being probably 100x faster, and taking much less resources.

I can see it executing:

SELECT * FROM externallinks LIMIT 107115000,15000

Which is taking minutes to execute, as it will be reading 100M rows.

This is a big no for paging in MySQL, as it takes as much time as scanning the whole table, while only retrieving 15K records.

If the aim is to do a full externallinks scan, there is a nice el_id index to iterate, e.g.:

SELECT * FROM externallinks WHERE el_id > 107115000 ORDER BY el_id LIMIT 15000;

(then you use the last returned id to do the next query, until no rows are returned).
Which takes only 0.05 sec., (6000x faster) and you can scan the full table in less than a few minutes.

Please show me your code and I will fix it for you, and you will never have to wait again- plus you will make more resources available for other users.

Related Objects

Event Timeline

If you want to do it yourself instead, here there is some links with more information on efficient paging:
https://www.xarg.org/2011/10/optimized-pagination-using-mysql/
https://stackoverflow.com/a/3799293/342196
https://www.percona.com/files/presentations/ppc2009/PPC2009_mysql_pagination.pdf

If you are more familiar with mediawiki- it implements this mechanism both for api queries and browser results for things like categories, edit history, contributions, etc.

jcrespo triaged this task as High priority.Mar 31 2018, 7:38 PM

Because not even an acknowledgment has been received in 7 days, and this is likely the cause of a service crash, the account has been temporarily banned from accessing the wikireplicas.

I got no emails of this ticket, and only became aware of it when things broke on my end.

I've disabled the offending task. Please unban the bot as it's affecting a number of services.

You were doing over 6000 full table scans- I don't think it is wise to unblock the task until the code is fixed. Please use https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-externallinks.sql.gz instead. What do you mean with "I've disabled the offending task" ? What do you mean with "the code is quite old"?

To prevent further disruption, you should use a separate tool (with its own database creadentials) for each task that is independent- alternatively, cloud team probably could make a separate user for a separate tasks in a single tool, if requested, to prevent situations like this.

In any case, wikireplicas are right now in a highly degraded state, so probably no further changes will be done until T191149 is fixed or the whole replicas will go down for all users.

So, I shut down the script that’s responsible for this and I’m still not allowed to use the replicas?

So now I'm confused. The task that was making these heavy queries have now been disabled, so why is my tool and conversely Cyberbot still not allowed to access the replicas?

Please be patient, I will take care of this as soon as the (non-trivial, ongoing) fires are solved on wikireplicas, for *all users*. Yours is not the only tool being affected.

Now that all servers are back online, I can manage the users again:

the account has been reenabled -but will be disabled or limited if bad patterns are detected again.

Cyberpower678 claimed this task.