Page MenuHomePhabricator

reflinks.py doesn't save changes to articles
Closed, ResolvedPublic

Description

It is quite a strange problem that I observe for a long time and I am not sure how to reproduce it...

I am running the latest reflinks.py from core on Labs:
https://ru.wikipedia.org/w/index.php?limit=50&tagfilter=&title=Служебная%3AВклад&contribs=user&target=Rubinbot&namespace=&tagfilter=&year=2015&month=-1

As you can see from the bot's contributions, the bot starts working well but then stop editing while logs on the labs show that the script is still running and the bot even saves pages...
For example, I see in the recent output the following:

@@ -57 +57 @@
- Na sed'moj den' s"лmok v ofise odnoj iz kinokompanij, osushchestvlyayushchikh proizvodstvo fil'ma, prorvalo trubu, i ofis byl zatoplen. Stoit otmetit', chto syuzhet i sobytiya fil'ma tesno svyazany s vodnoj sredoj. Odnako na иtom strannye sluchai ne konchilis': vo vremya s"лmok rekvizitnyj furgon «okkupiroval» roj pchлl, v svyazi s иtim ves' rekvizit prishlos' spasat'. Pchлly ischezli tak zhe vnezapno, kak i poyavilis'. Takzhe v uzhe upomyanutom ofise opyat' proizoshlo zatoplenie - na иtot raz vinovnikom иtogo stal razlivshijsya bol'shoj kontejner s vodoj. Strannyj sluchaj proizoshлl i s kostyumershej fil'ma - vyjdya s parkovki, nakhodyashchejsya na territorii studii ''[[Universal Pictures]]'', ona uvidela begushchego k nej po asfal'tu olenya. V иtom sluchae takzhe proslezhivaetsya svyaz' s sobytiyami fil'ma, gde mozhno videt' pokhozhuyu scenu.<ref name="amr">[http://www.kinopoisk.ru/level/1/film/61305/view_info/ok/ Informaciya o fil'me na kinopoisk.ru]</ref> Suevernyj Khidиo Nakata posle podobnykh proisshestvij pribegal k ritualu izgnaniya zlykh dukhov. ***
+ Na sed'moj den' s"лmok v ofise odnoj iz kinokompanij, osushchestvlyayushchikh proizvodstvo fil'ma, prorvalo trubu, i ofis byl zatoplen. Stoit otmetit', chto syuzhet i sobytiya fil'ma tesno svyazany s vodnoj sredoj. Odnako na иtom strannye sluchai ne konchilis': vo vremya s"лmok rekvizitnyj furgon «okkupiroval» roj pchлl, v svyazi s иtim ves' rekvizit prishlos' spasat'. Pchлly ischezli tak zhe vnezapno, kak i poyavilis'. Takzhe v uzhe upomyanutom ofise opyat' proizoshlo zatoplenie - na иtot raz vinovnikom иtogo stal razlivshijsya bol'shoj kontejner s vodoj. Strannyj sluchaj proizoshлl i s kostyumershej fil'ma - vyjdya s parkovki, nakhodyashchejsya na territorii studii ''[[Universal Pictures]]'', ona uvidela begushchego k nej po asfal'tu olenya. V иtom sluchae takzhe proslezhivaetsya svyaz' s sobytiyami fil'ma, gde mozhno videt' pokhozhuyu scenu.<ref name=autogenerated1>[http://www.kinopoisk.ru/level/1/film/61305/view_info/ok/ Informaciya o fil'me na kinopoisk.ru]</ref> Suevernyj Khidиo Nakata posle podobnykh proisshestvij pribegal k ritualu izgnaniya zlykh dukhov. ***

Comment: bot: dobavlenie zagolovkov v snoski; ispravlenie dvojnykh snosok, sm. [[mw:Manual:Pywikibot/refLinks|ChaVo]] ***

That's the edit in the following article:
https://ru.wikipedia.org/w/index.php?title=%D0%97%D0%B2%D0%BE%D0%BD%D0%BE%D0%BA_2_(%D1%84%D0%B8%D0%BB%D1%8C%D0%BC,_2005)&action=history

Though the bot says that the page was edited, there was no actual changes in the history.

I am gonna clean up the logs and start the bot again but if you have advice how to catch the problems and the issue, I will be glad to use your instructions.

For information: I run the bot by the following command:

jsub -N reflinks -once -continuous -mem 2048M sh refs.sh

Details

Related Gerrit Patches:

Event Timeline

Rubin16 created this task.Apr 1 2015, 12:50 PM
Rubin16 raised the priority of this task from to Normal.
Rubin16 updated the task description. (Show Details)
Rubin16 added a project: Pywikibot.
Rubin16 added subscribers: Unknown Object (MLST), Rubin16.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2015, 12:50 PM

The same issue happened today: I ran the bot with extended memory limit up to 4096M as I thought it could be issue.

The bot saved changes till the end of April 1:
https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%92%D0%BA%D0%BB%D0%B0%D0%B4/Rubinbot

Now I see in logs that bot said that it had saved changes to a number of articles but there is nothing similar in reality:

https://ru.wikipedia.org/w/index.php?title=%D0%97%D1%83%D1%81%D0%BC%D0%B0%D0%BD,_%D0%90%D0%BB%D0%BB%D0%B0_%D0%92%D0%B5%D0%BD%D0%B8%D0%B0%D0%BC%D0%B8%D0%BD%D0%BE%D0%B2%D0%BD%D0%B0&action=history

@@ -12 +12 @@
- Naibol'shij nauchnyj rezonans<ref>[https://scholar.google.com/scholar?cites=3475199132505430838&as_sdt=5,33&sciodt=0,33&hl=en Spisok nauchnykh rabot, citiruyushchikh monografiyu «Directed Evolution»]</ref> poluchila razrabotannaya Borisom Zlotinym i Alloj Zusman «sistema napravlennoj иvolyucii», izlozhennaya imi v knige «Directed Evolution: Philosophy, Theory and Practice» (2001, 2004). Sredi drugikh razrabotok - «diversionnyj analiz» (metod vyyavleniya i prognozirovaniya avarijnykh situacij i nezhelatel'nykh yavlenij)<ref>[http://vikent.ru/enc/852/ Diversionnyj analiz]</ref><ref>[http://www.rusnor.org/pubs/articles/11541.htm Oleg Figovskij «Tekhnologiya innovacij»]</ref>. ***
+ Naibol'shij nauchnyj rezonans<ref name=autogenerated1>[https://scholar.google.com/scholar?cites=3475199132505430838&as_sdt=5,33&sciodt=0,33&hl=en Spisok nauchnykh rabot, citiruyushchikh monografiyu «Directed Evolution»]</ref> poluchila razrabotannaya Borisom Zlotinym i Alloj Zusman «sistema napravlennoj иvolyucii», izlozhennaya imi v knige «Directed Evolution: Philosophy, Theory and Practice» (2001, 2004). Sredi drugikh razrabotok - «diversionnyj analiz» (metod vyyavleniya i prognozirovaniya avarijnykh situacij i nezhelatel'nykh yavlenij)<ref>[http://vikent.ru/enc/852/ Diversionnyj analiz]</ref><ref>[http://www.rusnor.org/pubs/articles/11541.htm Oleg Figovskij «Tekhnologiya innovacij»]</ref>. ***

@@ -39 +39 @@
- * Alla Zusman, Boris Zlotin. Directed Evolution: Philosophy, Theory and Practice. Ideation International Inc., 2001, 2004.<ref>[https://scholar.google.com/scholar?cites=3475199132505430838&as_sdt=5,33&sciodt=0,33&hl=en Spisok nauchnykh rabot, citiruyushchikh monografiyu «Directed Evolution»]</ref> ***
+ * Alla Zusman, Boris Zlotin. Directed Evolution: Philosophy, Theory and Practice. Ideation International Inc., 2001, 2004.<ref name=autogenerated1 /> ***
 ***
Comment: bot: dobavlenie zagolovkov v snoski; ispravlenie dvojnykh snosok, sm. [[mw:Manual:Pywikibot/refLinks|ChaVo]] ***

I don't know how to cope with this problem...

Rubin16 raised the priority of this task from Normal to High.Apr 3 2015, 7:32 AM
Rubin16 set Security to None.

I started the bot with -debug option, the same output is shown but one warning message is added now:

WARNING: /data/project/rubinbot2/core/pywikibot/tools/ip.py:53: DeprecationWarning: ipaddress backport not available.

Couldn't it be the reason?

XZise added a subscriber: XZise.Apr 3 2015, 8:27 PM

But this warning is shown at all times isn't it (or on startup)? So that wouldn't really explain why it's suddenly stops working and I don't know how this could interfere without showing more errors.

But the missing errors are what I'm wondering. I mean if it would fail about an hour after it did the first edit then it could be related to T61678: Implement badtoken detection and recovery but that would cause an error. But otherwise I don't see why it would silently not do edits. When you log it (afaik does -debug also activates logging it) it should show you the requests it does to the API and also the responses. Do you see requests happening after the bot starts working?

Another problem (related to the logs) I have is that I can only see edits which work :/

Yes, the error about ipadress backport is shown every edit...

I started a version from compat now and it works correctly.

There are no other errors/warnings shown while using -debug feature, I can start it once more and upload full traceback somewhere if it would be helpful.

jayvdb added a subscriber: jayvdb.Apr 4 2015, 7:34 AM
XZise added a comment.Apr 4 2015, 8:39 AM

When you actually mean traceback then yeah please post it because that is the most helpful information for bug reports. In fact you always should include it in any bug report (not only with pywikibot but for all software which shows you a traceback/stack trace).

If you just mean the log, well it could be also helpful as I have no idea what could cause that and maybe that would clear things up. It's very hard to fix something when the description of the bug is very vague and it's not a common bug. Just check through the log that it doesn't accidentally contain sensitive information (like passwords).

When you actually mean traceback then yeah please post it because that is the most helpful information for bug reports. In fact you always should include it in any bug report (not only with pywikibot but for all software which shows you a traceback/stack trace).
If you just mean the log, well it could be also helpful as I have no idea what could cause that and maybe that would clear things up. It's very hard to fix something when the description of the bug is very vague and it's not a common bug. Just check through the log that it doesn't accidentally contain sensitive information (like passwords).

I am gonna run the bot with the -debug parameter and then post the whole output...
Is it possible to gather more information with other parameters?

XZise added a comment.Apr 4 2015, 2:14 PM

Not sure if -v -v helps. It certainly produces a more verbose output ;)

The output is quite huge, so I am posting parts of it:

  1. from the beginning to the first article saved (saved per output and per actual edits, too):

http://pastebin.com/rb2tTxGP

  1. the second edit that was actually saved

http://pastebin.com/PwUbTsCW

After those 2 edits no pages were saved - I looked for "@@" in output and pasted such saves that weren't saved, the bot has some API problems as I see:

  1. http://pastebin.com/eRRkvDwP
  2. http://pastebin.com/Re5Jxxit

All other edits show the same error.

XZise added a comment.Apr 5 2015, 2:10 PM

Okay interesting. It seems to be T61678: Implement badtoken detection and recovery. The last two log entries were very helpful even though a bit confusing, but the actual error is:

Error saving page [[Al'-Dzhazira]] (badtoken: Invalid token) ***
 
Traceback (most recent call last):
  File "/data/project/rubinbot2/core/pywikibot/page.py", line 1047, in _save
    watch=watchval, bot=botflag, **kwargs)
  File "/data/project/rubinbot2/core/pywikibot/site.py", line 1057, in callee
    return fn(self, *args, **kwargs)
  File "/data/project/rubinbot2/core/pywikibot/site.py", line 4140, in editpage
    result = req.submit()
  File "/data/project/rubinbot2/core/pywikibot/data/api.py", line 1698, in submit
    raise APIError(code, info, **result["error"])
APIError: badtoken: Invalid token

If you could open scripts/reflinks.py and add the 3rd line of the following snippet (line 520):

editedpages = 0                                                         
for page in self.generator:                                             
    page.site.tokens._tokens.clear()                                    
    try:                                                                
        # Load the page's text from the wiki

That is the work around I was talking about and clears the token cache. In that case it shouldn't suddenly stop working and if it works then for you I think there should be really a fix for T61678 happening soon.

XZise added a comment.Apr 5 2015, 3:20 PM

Okay the logs have been a bit confusing at first because the responses didn't contain any information about the error itself so I tried to determine why that is and the result is in https://gerrit.wikimedia.org/r/#/c/201954/ . It won't fix your problem but in the logs it should now show the complete error message looking something like that:

{u'servedby': u'mw1233', u'error': {'help': u'See https://ru.wikipedia.org/w/api.php for API usage', 'info': 'Invalid token', 'code': 'badtoken'}}

And that would tell it immediately. But I'm still confused why you had to activate the log to show the error. It should've shown at least something.

I updated the bot to the latest revision (but not changed anything in reflinks.py in your post about badtoken recovery), the crash reports are the following.

Not it seems that it doesn't save not only because of API problems, but something more happens.

No changes were needed on [[Ziggy Stardust]]
No changes were needed on [[Zhuzhou Electric Locomotive]]
No changes were needed on [[Zimmer (avtomobil'naya marka)]] ***
Traceback (most recent call last):
  File "core/pwb.py", line 227, in <module>
    run_python_file(filename, argv, argvu, file_package)
  File "core/pwb.py", line 85, in run_python_file
    main_mod.__dict__)
  File "core/scripts/reflinks.py", line 846, in <module>
    main()
  File "core/scripts/reflinks.py", line 843, in main
    bot.run()
  File "core/scripts/reflinks.py", line 647, in run
    linkedpagetext = self.NON_HTML.sub('', linkedpagetext)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 17717: ordinal not in range(128)
Found 1 wikipedia:mk processes running, including this one.
Dropped throttle(s).
CRITICAL: Waiting for 1 network thread(s) to finish. Press ctrl-c to abort
All threads finished.
=== Pywikibot framework v2.0 -- Logging header ===
COMMAND: ['reflinks.py', '-start:Z', '-always', '-ignorepdf', '-debug', '-log:refs.txt', '-v', '-v']
DATE: 2015-04-18 15:53:55.071461 UTC
VERSION: [https] r-pywikibot-core.git (c6ad0b3, g5374, 2015/04/17, 22:16:31, n/a)
SYSTEM: ('Linux', 'tools-exec-10', '3.2.0-59-virtual', '#90-Ubuntu SMP Tue Jan 7 23:02:33 UTC 2014', 'x86_64')
CONFIG FILE DIR: /data/project/rubinbot2/core
PACKAGES:
  _elementtree (/usr/lib/python2.7/lib-dynload/_elementtree.so) = 1.0.6
  distutils (/usr/lib/python2.7/distutils/) = 2.7.3
  email (/usr/lib/python2.7/email/) = 4.0.3
  json (/usr/lib/python2.7/json/) = 2.0.9
  logging (/usr/lib/python2.7/logging/) = 0.5.1.2
  mwparserfromhell (/usr/lib/python2.7/dist-packages/mwparserfromhell/) = 0.3.3
  pickle (/usr/lib/python2.7/pickle.pyc) = $Revision: 72223 $
  pyexpat (/usr/lib/python2.7/lib-dynload/pyexpat.so) = 2.7.3
  pywikibot ([path unknown]) = ??
  re (/usr/lib/python2.7/re.pyc) = 2.2.1
  urllib (/usr/lib/python2.7/urllib.pyc) = 1.17
  urllib2 (/usr/lib/python2.7/urllib2.pyc) = 2.7
MODULES:
  /data/project/rubinbot2/core/pywikibot/comms/http.py 54549f1 2015-04-18 15:48:23.664873
  /data/project/rubinbot2/core/pywikibot/data/api.py 41d6c82 2015-04-18 15:48:23.744875
  /data/project/rubinbot2/core/pywikibot/userinterfaces/__init__.py 43eceeb 2015-04-18 15:48:24.104884
  /data/project/rubinbot2/core/pywikibot/textlib.py 410c100 2015-04-18 15:48:24.068883
  /data/project/rubinbot2/core/pywikibot/i18n.py 2fff519 2015-04-18 15:48:23.964880
  /data/project/rubinbot2/core/pywikibot/comms/threadedhttp.py a388c0f 2015-04-18 15:48:23.684874
  /data/project/rubinbot2/core/pywikibot/date.py 976edef 2015-04-18 15:48:23.768876
  /data/project/rubinbot2/core/pywikibot/data/__init__.py 44183c7 2015-04-18 15:48:23.732875
  /data/project/rubinbot2/core/pywikibot/exceptions.py 117165a 2015-04-18 15:48:23.788876
  /data/project/rubinbot2/core/pywikibot/site.py 5848169 2015-04-18 15:48:24.056883
  /data/project/rubinbot2/core/pywikibot/bot.py 446e8e1 2015-04-18 15:48:23.648873
  /data/project/rubinbot2/core/pywikibot/__init__.py db96621 2015-04-18 15:48:23.628872
  /data/project/rubinbot2/core/pywikibot/throttle.py 4157254 2015-04-18 15:48:24.080883
  /data/project/rubinbot2/core/pywikibot/page.py a2eb04c 2015-04-18 15:48:24.016882
  /data/project/rubinbot2/core/pywikibot/family.py f0d0d8f 2015-04-18 15:48:23.948880
  /data/project/rubinbot2/core/pywikibot/xmlreader.py 2b77234 2015-04-18 15:48:24.192886
  /data/project/rubinbot2/core/pywikibot/plural.py c9edb6b 2015-04-18 15:48:24.032882
  /data/project/rubinbot2/core/scripts/noreferences.py 3fa562a 2015-04-18 15:48:24.648897
  /data/project/rubinbot2/core/pywikibot/version.py 85f37a9 2015-04-18 15:48:24.184886
  /data/project/rubinbot2/core/pywikibot/userinterfaces/terminal_interface.py 9a5fbf1 2015-04-18 15:48:24.124884
  /data/project/rubinbot2/core/pywikibot/config2.py 971b19d 2015-04-18 15:48:23.728874
  /data/project/rubinbot2/core/pywikibot/tools/ip.py 808c0cc 2015-04-18 15:48:24.100884
  /data/project/rubinbot2/core/pywikibot/userinterfaces/terminal_interface_base.py 968a14b 2015-04-18 15:48:24.136885
  /data/project/rubinbot2/core/pywikibot/pagegenerators.py e07bf44 2015-04-18 15:48:24.028882
  /data/project/rubinbot2/core/pywikibot/userinterfaces/terminal_interface_unix.py 60d8cb2 2015-04-18 15:48:24.140885
  /data/project/rubinbot2/core/pywikibot/tools/__init__.py 22d6c4d 2015-04-18 15:48:24.096884
  /data/project/rubinbot2/core/pywikibot/diff.py 6e0e38b 2015-04-18 15:48:23.772876
  /data/project/rubinbot2/core/pywikibot/login.py c0cc049 2015-04-18 15:48:23.992881
  /data/project/rubinbot2/core/pywikibot/comms/__init__.py 747d0a7 2015-04-18 15:48:23.660873
  /data/project/rubinbot2/core/pywikibot/userinterfaces/transliteration.py efd4103 2015-04-18 15:48:24.160885
=== === === === === === === === === === === === === === 
Pywikibot rdb9662181f969ff16548641a4fb937164b4647bb
Python 2.7.3 (default, Feb 27 2014, 19:58:35) 
[GCC 4.6.3]
LOADING SITE wikipedia:ru VERSION: 1.26wmf1
LOADING SITE wikipedia:ru VERSION: 1.26wmf1
Found 1 wikipedia:ru processes running, including this one.

After that the bot restarts, and crashes again and again with the same message.

Some more output about the unicode crash:

The base directory is /data/project/rubinbot2/core
HTTP error (401) for http://books.google.ru/books?id=dLsthWsAHk0C&pg=PT1163&dq=ZKRD&hl=ru&sa=X&ei=tpATU_HvL-XR4QSYgYGADg&ved=0CE8Q6AEwBQ on [[ZKRD]]
HTTP error (401) for http://books.google.ru/books?id=j2tmrfsX4tAC&pg=RA18-PT1192&dq=ZKRD&hl=ru&sa=X&ei=tpATU_HvL-XR4QSYgYGADg&ved=0CCoQ6AEwAA on [[ZKRD]]
HTTP error (401) for http://books.google.ru/books?id=7DiB3z2fBpAC&pg=PT464&dq=%22zero-player+game%22 on [[Zero Player Game]]
<type 'exceptions.UnicodeDecodeError'>

Okay as far as I can see, it is querying the server and gets bytes back but then it tries to apply a regex and to do that it apparently needs to decode the bytes and it's using the ASCII codec for that. I'm not 100 % sure why it tries to decode it here and why it had worked before (I guess T89589: Usage of unicode_literals from __future__ package plays a role). I need to take a closer look on how to fix that properly.

Okay I'm confused. It works fine for me:

Python 2.7.8 (default, Aug 28 2014, 16:47:22) 
[GCC 4.8.3 20140624 (Red Hat 4.8.3-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.match('A', u'Ö'.encode('latin1'))
>>> re.match(u'A', u'Ö'.encode('latin1'))
>>> re.sub(u'A', '', u'Ö'.encode('latin1'))
'\xd6'
>>> re.sub(u'(?is)A', '', u'Ö'.encode('latin1'))
'\xd6'
>>> 
xzise@localhost:~/Programms/pywikibot/core$ python
Python 2.7.8 (default, Aug 28 2014, 16:47:22) 
[GCC 4.8.3 20140624 (Red Hat 4.8.3-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import unicode_literals
>>> import re
>>> re.sub(u'(?is)A', '', u'Ö'.encode('latin1'))
'\xd6'
>>> re.sub('(?is)A', '', u'Ö'.encode('latin1'))
'\xd6'
>>> R = re.compile('(?is)A')
>>> R.sub('', u'Ö'.encode('latin1'))
'\xd6'

@XZise if you want, I can temporary give you autopatrolled or bot rights in ru.wiki and you can try to run the reflinks.py there... I tried to run the bot from another set of articles - the same decoding error appears in other articles, too.

XZise added a comment.Apr 19 2015, 3:41 PM

Ah I think no need for that, because I think I know what is happening:

>>> from __future__ import unicode_literals
>>> import re
>>> re.sub('(?is)A', '', 'Ö'.encode('latin1'))
'\xd6'
>>> re.sub('(?is)A', '', 'ÖA'.encode('latin1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xzise/.pyenv/versions/2.7.8/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 0: ordinal not in range(128)

The error only appears when it actually replaces anything. In my previous examples it didn't replaced anything and it worked. But when it replaces something it tries to put the unicode into the bytes which doesn't work. You could test and verify that when you edit the line where the error happens (from your previous errors that is "core/scripts/reflinks.py" in line 647). Currently it looks like this:

linkedpagetext = self.NON_HTML.sub('', linkedpagetext)

But it should work when it looks like this:

linkedpagetext = self.NON_HTML.sub(str(''), linkedpagetext)

I need to figure out if linkedpagetext is also bytes in Python 3 but that fix will work at least in Python 2.

Change 205095 had a related patch set uploaded (by XZise):
[FIX] reflinks: Expect that urlopen return bytes

https://gerrit.wikimedia.org/r/205095

XZise added a comment.Apr 19 2015, 4:13 PM

Okay if you want you could try that patch. It seemed that this script wasn't able to run at all with Python 3.

Change 205095 merged by jenkins-bot:
[FIX] reflinks: Expect that urlopen return bytes

https://gerrit.wikimedia.org/r/205095

jayvdb closed this task as Resolved.Apr 20 2015, 1:13 PM
jayvdb claimed this task.