Page MenuHomePhabricator

PAWS gives API-errors on some cases
Open, Needs TriagePublicBUG REPORT

Description

I learned on Telegram that PAWS was updated today. Great service! :-) It had some quirks this afternoon, but the below error i can not get bypassed.
It works with regular categories, but with categories outside the main space it has troubles...

Steps to Reproduce:

import pywikibot

print('ok')
site=pywikibot.Site('nl','wikipedia')
repo=site.data_repository()

catp=pywikibot.Page(site,'Categorie:Wikipedia:Doorverwijspagina')
wdc=catp.data_item()
wdc.get()
print('done')

Actual Results:
WARNING: API error mwoauth-invalid-authorization-invalid-user: The authorization headers in your request are for a user that does not exist here

---------------------------------------------------------------------------
NoUsername                                Traceback (most recent call last)
<ipython-input-6-ba6d8c3874f3> in <module>
      6 
      7 catp=pywikibot.Page(site,'Categorie:Wikipedia:Doorverwijspagina')
----> 8 wdc=catp.data_item()
      9 wdc.get()
     10 print('done')

Expected Results:
This worked till yesterday...

Event Timeline

My best guess is this is this is similar to T168222
Not PAWS related but pywikibot hitting other wikis from the wikidata item.

Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald Transcript

@RhinosF1 precisely, but the code should only hit wikidata and nl wikipedia.

@RhinosF1 precisely, but the code should only hit wikidata and nl wikipedia.

I haven’t investigated yet but I guess it hits all sites which are in wiki data site links.

This worked till yesterday...

There wasn’t any changes in Pywikibot stable release. Two Wikis where added.

Edoderoo wrote:
I learned on Telegram that PAWS was updated today

What was updated there?

I have noticed two things.

  1. if you visit the page via browser, from that point on the wiki becomes visible in https://meta.wikimedia.org/wiki/Special:CentralAuth?target=Chicocvenancio
  1. if in api.py a GET or a POST request is done, it makes a difference
def _use_get(self):
    """Verify whether 'get' is to be used."""
    if (not config.enable_GET_without_SSL
            and self.site.protocol() != 'https'
            or self.site.is_oauth_token_available()):  # T108182 workaround
        use_get = False
    elif self.use_get is None:
        if self.action == 'query':
            # for queries check the query module
            modules = set()
            for mod_type_name in ('list', 'prop', 'generator'):
                modules.update(self._params.get(mod_type_name, []))
        else:
            modules = {self.action}
        if modules:
            self.site._paraminfo.fetch(modules)
            use_get = all('mustbeposted' not in self.site._paraminfo[mod]
                          for mod in modules)
        else:
            # If modules is empty, just 'meta' was given, which doesn't
            # require POSTs, and is required for ParamInfo
            use_get = True
    else:
        use_get = self.use_get
    return use_get

self.site.is_oauth_token_available() is always True for all sites because PAWS sets in user-config.py:

# If OAuth integration is available, take it
if 'CLIENT_ID' in os.environ:
    authenticate['*'] = (
        os.environ['CLIENT_ID'],
        os.environ['CLIENT_SECRET'],
        os.environ['ACCESS_KEY'],
        os.environ['ACCESS_SECRET']
    )

So the api request is done with use_get = False, e.g.

API request to wikiquote:az (uses get: False):
Headers: {'Content-Type': 'application/x-www-form-urlencoded'}
URI: '/w/api.php'
Body: 'action=query&meta=siteinfo%7Cuserinfo&siprop=namespaces%7Cnamespacealiases%7Cgeneral&continue=&uiprop=blockinfo%7Chasmsg&maxlag=5&format=json'
API response received from wikiquote:az:
{"error":{"code":"mwoauth-invalid-authorization-invalid-user","info":"The authorization headers in your request are for a user that does not exist here", ...

If I change PAWS user-config.py and remove oAuth, it will use use_get = True and the script will not fail, e.g.

API request to wikiquote:az (uses get: True):
Headers: {'Content-Type': 'application/x-www-form-urlencoded'}
URI: '/w/api.php?action=query&meta=siteinfo%7Cuserinfo&siprop=namespaces%7Cnamespacealiases%7Cgeneral&continue=&uiprop=blockinfo%7Chasmsg&maxlag=5&format=json'
Body: None
API response received from wikiquote:az:
{"batchcomplete":"","query":{"namespaces": ...

Why GET/POST make a difference with OAuth is explained here: T108182

Point 2) might explain why behavior is different between PAWS and local client (probably different "authenticate" dict settings.

So the issue is probably how to register for the first time in a site via pywikibot and reproduce CantralAuth flow?

At the end of the script I have added the following credential status reporting:

for site in sorted(pywikibot._sites.values()):
    print(site, site.username(), site.is_oauth_token_available(), site.logged_in())

Result for standalone Pywikibot client:

commons:commons GeertivpBot False False
wikidata:wikidata GeertivpBot False True

Result for PAWS Pywikibot script:
commons:commons GeertivpBot True False
wikidata:wikidata GeertivpBot True False

There is clearly a difference in credential status between a standalone Pywikibot and the PAWS Pywikibot. This can explain the difference in behaviouw when (implicitly) accessing a P18 to Wikimedia Commons. The script is not intentionally accessing the P18 image.

The previous reply would suggest that PAWS should use GET instead of POST. But it is not clear to me how to enforce that. It is suggesting a special OAuth setup in PAWS user-config.py.

The previous reply would suggest that PAWS should use GET instead of POST. But it is not clear to me how to enforce that. It is suggesting a special OAuth setup in PAWS user-config.py.

This is an issue in pywikibot.
The same error can be reproduced in local client, configuring as appropriate in user-config.py.

The chain of events should be:

  • if user does not exist on a given site (-> e.g. what trigger the error today) then
  • user should be created on site (e.g. with site.login(autocreate=True) it works when using password as authentication method)
  • site will apper under "Global account information" linked above
  • at that point even a POST request would work

Now it does not work as user does not exist and POST request (which is enforced by OAuth) fails.

But the GeertivpBot user appears on commons.wikimedia.org => Already created at 16 jul 2020 09:08
See https://www.wikidata.org/wiki/Special:CentralAuth/GeertivpBot
I still do not understand what happens?

Hhmm ... could you try this with no user-config.py in your base dir in PAWS?

import pywikibot
site = pywikibot.Site('commons', 'commons')
site.login()
site.logged_in()

I deleted the PAWS base dir user-config.py. Executing the above code.
The Commons login works:

True

for site in sorted(pywikibot._sites.values()):
    print(site, site.username(), site.is_oauth_token_available(), site.logged_in())

commons:commons GeertivpBot True True

Still the same problem when accessing a Wikidata item that has a P18 image registered -> image is not explicitly accessed... already a failure on the item.get() instruction

import pywikibot
from pywikibot import pagegenerators as pg
querytxt = 'SELECT DISTINCT ?item WHERE { ?Joseph_von_Hammer_Purgstall wdt:P921 wd:Q89546, ?item. }'
wikidata_site = pywikibot.Site("wikidata", "wikidata")
generator = pg.WikidataSPARQLPageGenerator(querytxt, site=wikidata_site)
for item in generator:
    item.get()
WARNING: API error mwoauth-invalid-authorization-invalid-user: The authorization headers in your request are for a user that does not exist here

https://public.paws.wmcloud.org/63056561/Untitled5.ipynb

Right at the end there:

Failed OAuth authentication for wikisource:fr

This query is somehow hitting french wikisource, and the bot probably doesn't have an account there.

What I did:

Now the item.get() above works... thank for your clarification, Chicocvenancio
So the problem does not have to do with P18 referring to Commons... (because I did already have an account there)

The following problems are still not explained:

  1. The item Q89546 has also an entry for de.wikisource where I do not have an account and it still works in PAWS...?
  2. For a standalone Pywikibot client the item.get() still does always work, even if the user accounts don't exist on the implicitly referenced wikis
  3. Why does PAWS require the manual creation of all the "foreign" accounts?
  4. Why is this not required for de.wikisource?

Change 629842 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] OAuth: minimize use of POST with OAuth

https://gerrit.wikimedia.org/r/629842

@Geertivp Why does PAWS requires the accounts?

PAWS itself does not require the accounts, Pywikibot when authenticated with OAuth, does (it is a bug). PAWS authenticates with OAuth for security reasons, (files and even notebook cells outputs are, by default, public. And even if you hide them with permissions they would still be accessible to a wide number of users with administrative permissions in PAWS or Cloud Services as whole, not to mention the possibility of a vulnerability).

The following problems are still not explained:

  1. The item Q89546 has also an entry for de.wikisource where I do not have an account and it still works in PAWS...?
  2. For a standalone Pywikibot client the item.get() still does always work, even if the user accounts don't exist on the implicitly referenced wikis
  3. Why does PAWS require the manual creation of all the "foreign" accounts?
  4. Why is this not required for de.wikisource?
  1. Only SiteLinks with namespaces will generate a Site object (do not ask me the logic behind it in Wikibase handling), in this case:
('arwiki', pywikibot.page.SiteLink('arwiki', 'جوزيف فون هامر-برجشتال', []))
('arzwiki', pywikibot.page.SiteLink('arzwiki', 'چوزيف فون هامر برجشتال', []))
('cawiki', pywikibot.page.SiteLink('cawiki', 'Joseph von Hammer-Purgstall', []))
('commonswiki', pywikibot.page.SiteLink('commonswiki', 'Category:Joseph von Hammer-Purgstall', []))  <-- *****
('cswiki', pywikibot.page.SiteLink('cswiki', 'Joseph von Hammer-Purgstall', []))
('dawiki', pywikibot.page.SiteLink('dawiki', 'Joseph von Hammer-Purgstall', []))
('dewiki', pywikibot.page.SiteLink('dewiki', 'Joseph von Hammer-Purgstall', []))
('dewikisource', pywikibot.page.SiteLink('dewikisource', 'Joseph von Hammer-Purgstall', []))
('enwiki', pywikibot.page.SiteLink('enwiki', 'Joseph von Hammer-Purgstall', []))
('eowiki', pywikibot.page.SiteLink('eowiki', 'Joseph von Hammer-Purgstall', []))
('eswiki', pywikibot.page.SiteLink('eswiki', 'Joseph von Hammer-Purgstall', []))
('fawiki', pywikibot.page.SiteLink('fawiki', 'یوزف فون هامر', []))
('fawikiquote', pywikibot.page.SiteLink('fawikiquote', 'جوزف فون هامر', []))
('frwiki', pywikibot.page.SiteLink('frwiki', 'Joseph von Hammer-Purgstall', []))
('frwikisource', pywikibot.page.SiteLink('frwikisource', 'Auteur:Joseph von Hammer-Purgstall', []))  <-- *****
('huwiki', pywikibot.page.SiteLink('huwiki', 'Joseph von Hammer-Purgstall', []))
('itwiki', pywikibot.page.SiteLink('itwiki', 'Joseph von Hammer-Purgstall', []))
('itwikisource', pywikibot.page.SiteLink('itwikisource', 'Autore:Joseph von Hammer-Purgstall', []))  <-- *****
('jawiki', pywikibot.page.SiteLink('jawiki', 'ジョセフ・フォン・ハンマー・プルグスタル', []))
('kawiki', pywikibot.page.SiteLink('kawiki', 'იოზეფ ფონ ჰამერ-პურგშტალი', []))
('kowiki', pywikibot.page.SiteLink('kowiki', '요제프 폰 하머푸르크시탈', []))
('ruwiki', pywikibot.page.SiteLink('ruwiki', 'Хаммер-Пургшталь, Йозеф фон', []))
('srwiki', pywikibot.page.SiteLink('srwiki', 'Јозеф фон Хамер-Пургштал', []))
('svwiki', pywikibot.page.SiteLink('svwiki', 'Joseph von Hammer-Purgstall', []))
('trwiki', pywikibot.page.SiteLink('trwiki', 'Joseph von Hammer-Purgstall', []))
('trwikiquote', pywikibot.page.SiteLink('trwikiquote', 'Joseph von Hammer-Purgstall', []))
('ukwiki', pywikibot.page.SiteLink('ukwiki', 'Йозеф фон Гаммер-Пургшталь', []))
  1. You probably are using a password-based authentication
  2. because it is OAuth-based and as explained above pywkibot has a bug/limitation
  3. see 1.

Change 629842 abandoned by Mpaa:
[pywikibot/core@master] OAuth: minimize use of POST with OAuth

Reason:

https://gerrit.wikimedia.org/r/629842

GET or POST is not the real issue, I think.
At this point, I think that with OAuth, auth is always sent with a request, and if a user does not exists, it fails.

The problem only seems to occur for sitelinks to special namespaces.

To solve this problem:

  • The problem does not seem to exist for sitelinks to the main namespace?
  • Why is there only a problem with sitelinks to special namespaces?
  • Couldn't a "delayed authentication" be implemented for item.get() until sitelinks are effectively referenced?
  • Or better: couldn't a user account be created automatically at first reference, just as happens when accessing a project explicitly using e.g. https://en.wikisource.org/wiki/Special:UserLogin

This way we could avoid most implicit login failures when using Pywikibot on PAWS?

At first, it would help if the error mentioned the wiki (and the item?) that caused the failed read. Now you only get an error "You do not exist on some wiki". Then I have no idea on what wiki to add my credentials.