Page MenuHomePhabricator

preload_sites.py script is too slow
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

run pwb preload_sites -worker:25

What happens?:
The script needs up to 2 minutes to be completed and collect all sites

What should have happened instead?:
The script should terminate within 20 seconds. The reason for this lameness is that userinfo like [1] is called for each site and this api call is not cached.
[1] https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=blockinfo%7Cgroups%7Chasmsg%7Cratelimits%7Crights&formatversion=2&maxlag=5&format=json

Software version:
Pywikibot 8.5.0 - 8.0.4

This issue was introduced with rPWBC891a720

Event Timeline

Xqt triaged this task as High priority.Oct 15 2023, 1:23 PM

Hi @Xqt what tests would you recommend to ensure the change is not causing regression?

I have the code change: https://gerrit.wikimedia.org/r/c/pywikibot/core/+/1033715 where userinfo is being cached and read based on family name and username matches.

While testing, I noticed cookies are added to http.cookie_jar when the api call is actually made to retrieve userinfo but I am unsure if this is the cookie that is meant to be loaded in preload_sites script.

Hi @ericpien: I think preload_sites script is a good measurement. After collecting the siteinfo (and after Pywikibot 8.1 userinfo) the second call must much more faster. Thank you for your patch but I would suggest to use the already implemented CachedRequest to cache the userinfo which is already used for siteinfo. The advantage is that parallel running task will also benefit from it. At userinfo method the cached request can be implemented like

if not hasattr(self, '_userinfo'):
    uirequest = self._request(
        expiry=1,
        parameters=dict(
            action='query',
            meta='userinfo',
            uiprop='blockinfo|hasmsg|groups|rights|ratelimits',
            formatversion=2,
        )
    )
    uidata = uirequest.submit()

The 2nd run needs only ~30 seconds with this patch compared to 8 minutes without it. Maybe we could add some runtime test like this to our test matrix.

Change #1034059 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] use CachedRequest for userinfo requests

https://gerrit.wikimedia.org/r/1034059

The regression was introduced with Pywikibot 8.1 (see above). Previous runs were 30 times faster for the first time and 300 times for the second. The patch above fastens up the second call but the api calls seems sequentiell instead of simultaneously which would be expected with concurrent.futures.

Pretty sure we need muxh more workers here. With 5000 workers I get all sites within 30 seconds.

Change #1034059 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] use CachedRequest for userinfo requests

https://gerrit.wikimedia.org/r/1034059

Change #1035864 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] Revert "[IMPR] use CachedRequest for userinfo requests"

https://gerrit.wikimedia.org/r/1035864

Change #1035864 merged by jenkins-bot:

[pywikibot/core@master] Revert "[IMPR] use CachedRequest for userinfo requests"

https://gerrit.wikimedia.org/r/1035864

Reopened after patch revert due to T365942

Change #1074562 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] pywikibot.scripts: Remove preload_sites.py

https://gerrit.wikimedia.org/r/1074562

Change #1074562 merged by jenkins-bot:

[pywikibot/core@master] pywikibot.scripts: Remove preload_sites.py

https://gerrit.wikimedia.org/r/1074562

Change #1175541 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] cleanup: preload_sites script was removed

https://gerrit.wikimedia.org/r/1175541

Change #1175541 merged by jenkins-bot:

[pywikibot/core@master] cleanup: preload_sites script was removed

https://gerrit.wikimedia.org/r/1175541