Page MenuHomePhabricator

APIError: too-many-titles with -start: page generator
Closed, ResolvedPublic

Description

Found when looking at T209094: weblinkchecker.py: TypeError: 'unicode' object is not callable (where in that one only API warning for this subject was raised)
Logged as simple user

$ ./pwb.py weblinkchecker -start:! -lang:en
Retrieving 240 pages from wikipedia:en.
WARNING: API error too-many-titles: Too many values supplied for parameter "titles". The limit is 50.

0 pages read
0 pages written
Execution time: 2 seconds
Script terminated by exception:

ERROR: APIError: too-many-titles: Too many values supplied for parameter "titles". The limit is 50. [help:See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]
Saving history...
Traceback (most recent call last):
  File "./pwb.py", line 257, in <module>
    if not main():
  File "./pwb.py", line 250, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "./pwb.py", line 119, in run_python_file
    main_mod.__dict__)
  File "./scripts/weblinkchecker.py", line 1055, in <module>
    main()
  File "./scripts/weblinkchecker.py", line 1017, in main
    bot.run()
  File "pywikibot/bot.py", line 1477, in run
    for item in self.generator:
  File "pywikibot/pagegenerators.py", line 1747, in RedirectFilterPageGenerator
    for page in generator or []:
  File "pywikibot/pagegenerators.py", line 2192, in PreloadingGenerator
    for i in site.preloadpages(group, groupsize):
  File "pywikibot/site.py", line 3406, in preloadpages
    for pagedata in rvgen:
  File "pywikibot/data/api.py", line 3148, in __iter__
    for result in super(PropertyGenerator, self).__iter__():
  File "pywikibot/data/api.py", line 2972, in __iter__
    self.data = self.request.submit()
  File "pywikibot/data/api.py", line 2273, in submit
    raise APIError(**result['error'])
pywikibot.data.api.APIError: too-many-titles: Too many values supplied for parameter "titles". The limit is 50. [help:See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.

Last git version, Python 2.7.15rc1

Event Timeline

Same error as bot user:

tools.framabot@tools-bastion-02:~$ ./pwb.py version
Pywikibot: [https] r-pywikibot-core.git (7ea6fba, g1, 2018/11/05, 17:45:43, ok)
Release version: 3.1.dev0
requests version: 2.2.1
  cacerts: /etc/ssl/certs/ca-certificates.crt
    certificate test: ok
Python: 2.7.6 (default, Nov 23 2017, 15:49:48) 
[GCC 4.8.4]
Toolforge hostname: tools-bastion-02
PYWIKIBOT_DIR: Not set
PYWIKIBOT_DIR_PWB: .
PYWIKIBOT_NO_USER_CONFIG: Not set
Config base dir: /data/project/framabot/.pywikibot
Usernames for family "wikinews":
	*: Framabot (no sysop configured)
Usernames for family "wikiquote":
	*: Framabot (no sysop configured)
Usernames for family "wikipedia":
	*: Framabot (no sysop configured)
Usernames for family "meta":
	*: Framabot (no sysop configured)
Usernames for family "wikidata":
	*: Framabot (no sysop configured)
Usernames for family "wikisource":
	*: Framabot (no sysop configured)
Usernames for family "wiktionary":
	*: Framabot (no sysop configured)
Usernames for family "commons":
	*: Framabot (no sysop configured)
Usernames for family "wikivoyage":
	*: Framabot (no sysop configured)
Usernames for family "wikiversity":
	*: Framabot (no sysop configured)
Usernames for family "wikibooks":
	*: Framabot (no sysop configured)
Xqt triaged this task as High priority.Nov 9 2018, 4:06 PM

Cannot reproduce it neither with python 3 nor 2:

C:\pwb\GIT\core>py -3 pwb.py weblinkchecker -start:! -simulate -user:xqt -lang:en
Retrieving 240 pages from wikipedia:en.


>>> !! <<<


>>> !!! <<<
C:\pwb\GIT\core>py -3 pwb.py version
Pywikibot: [ssh] pywikibot-core (69d0032, g10396, 2018/11/13, 17:21:12, ok)
Release version: 3.1.dev0
requests version: 2.19.1
  cacerts: C:\python37\lib\site-packages\certifi\cacert.pem
    certificate test: ok
Python: 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Int
el)]
C:\pwb\GIT\core>py -2 pwb.py version
Pywikibot: [ssh] pywikibot-core (69d0032, g10396, 2018/11/13, 17:21:12, ok)
Release version: 3.1.dev0
requests version: 2.9.1
  cacerts: C:\Python27\lib\site-packages\certifi\cacert.pem
    certificate test: ok
Python: 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit
(Intel)]

The problem is that in site.preloadpages(), max_ids is computed after the pageslist is splitted in chunks of 240.

for sublist in itergroup(pagelist, groupsize):   <----------- groupsize = 240
    # Do not use p.pageid property as it will force page loading.
    pageids = [str(p._pageid) for p in sublist
               if hasattr(p, '_pageid') and p._pageid > 0]
    cache = {}
    # In case of duplicates, return the first entry.
    for priority, page in enumerate(sublist):
        try:
            cache.setdefault(page.title(with_section=False),
                             (priority, page))
        except pywikibot.InvalidTitle:
            pywikibot.exception()

    prio_queue = []
    next_prio = 0
    rvgen = api.PropertyGenerator(props, site=self)
    rvgen.set_maximum_items(-1)  # suppress use of "rvlimit" parameter

    parameter = self._paraminfo.parameter('query+info', 'prop')
    if self.logged_in() and self.has_right('apihighlimits'):  <----------- False
        max_ids = int(parameter['highlimit'])
    else:
        max_ids = int(parameter['limit'])  # T78333, T161783   <----------- max_ids = 50

    if len(pageids) == len(sublist) and len(set(pageids)) <= max_ids: <----------- False
        # only use pageids if all pages have them
        rvgen.request['pageids'] = set(pageids)
    else:
        rvgen.request['titles'] = list(cache.keys())         <----------- len(cache) = 240  ---> PROBLEM!
    rvgen.request['rvprop'] = rvprop
    pywikibot.output('Retrieving %s pages from %s.'
                     % (len(cache), self))

Change 473940 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] [FIX] site.preloadpages: split pagelist in max_ids maximum

https://gerrit.wikimedia.org/r/473940

Change 473940 merged by jenkins-bot:
[pywikibot/core@master] [FIX] site.preloadpages: split pagelist in at most max_ids elements

https://gerrit.wikimedia.org/r/473940