In T173385#5809505, @Ruthven wrote:@Mpaa Well, the same happens without the option ql=4.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Jan 16 2020
Jan 16 2020
Jan 15 2020
Jan 15 2020
I am not sure but in this case you are creating a new page in a different wiki, using an existing page with ql=4.
Probably it is trying to set ql=4 also in the new page, violating the rule about change of status Not Proofread->Proofread, Validated.
@Tpt, any opinion?
Maybe pywikibot could be smarter here and check?
Mpaa added a comment to T242425: Wikidata shows different image (file) sizes, than Commons reports for the current version.
Went back to both integers.
To me this is solved.
Mpaa closed T242517: Thumbnails for PDF files not found when width is not an Integer in URL link as Resolved.
Seems OK to me.
Mpaa added a comment to T242795: File deleted at Commons is no longer available even after page restore.
All pages:
Jan 14 2020
Jan 14 2020
To understand the wanted behaviour, what is the wanted Site for a redirected 'code'?
Site('aa', 'wikisource') = ?
E.g. APISite("mul", "wikisource")?
Jan 12 2020
Jan 12 2020
Mpaa added a comment to T242517: Thumbnails for PDF files not found when width is not an Integer in URL link.
In T242517#5795298, @Umherirrender wrote:Image width and height are always integer, but it seems that some files are now uploaded with float values
I think width and height are evaluated at run time (possibly cached), so I assume when this will be deployed, things will get back to normal.
Jan 11 2020
Jan 11 2020
Mpaa added a comment to T239510: No preview thumbnail generated for PDF on Commons: "Error: 429, Too Many Requests".
Duplicate of T188885?
Mpaa added a comment to T242425: Wikidata shows different image (file) sizes, than Commons reports for the current version.
Dimensions are floats in both places now.
Mpaa updated the task description for T242517: Thumbnails for PDF files not found when width is not an Integer in URL link.
Mpaa updated the task description for T242517: Thumbnails for PDF files not found when width is not an Integer in URL link.
Mpaa updated the task description for T242517: Thumbnails for PDF files not found when width is not an Integer in URL link.
Mpaa updated subscribers of T242517: Thumbnails for PDF files not found when width is not an Integer in URL link.
Mpaa updated subscribers of T242517: Thumbnails for PDF files not found when width is not an Integer in URL link.
@Legoktm, @Umherirrender could you please take a look at this commit?
Is it intentional that the computation has been changed?
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PdfHandler/+/560127/2/includes/PdfImage.php
Jan 9 2020
Jan 9 2020
It's impossible to get image description from Commons via pywikibot API.
Because it is not exposed by mediawiki API, afaik.
In T242169#5788350, @Jan.Kamenicek wrote:But when I open the PDF document in my computer and copypaste the text into a word processor, it looks thus:
...passing safely through a country occupied by Sigismund's
troops, they arrived near Kralov6 Hradec. They called to...This means that the original text layer of the PDF is good, only Mediawiki extracts it badly.
This is not indicative, it can depend on a lot of things (OS, browser, pdf plugin used, etc.). I tried once on Linux and two on Windows (one with AcrobatReader, one reading the pdf in Edge), and I got three different results.
Jan 8 2020
Jan 8 2020
@Jan.Kamenicek, I didn't mean that the text layer got improved by the djvu converting process.
Flags used with pdftotext command matters, and they are set in Mediawiki, see my comparison above.
@Aklapper , I readded MediaWiki-extensions-PdfHandler as it seems relevant to me.
This might be useful to understand pdftotext options.
https://github.com/EmpowermentZone/EdSharp/blob/master/Convert/Xpdf/pdftotext.txt
Mpaa added a project to T242169: Bad text layer extraction from PDFs: MediaWiki-extensions-PdfHandler.
Playing with "pdftotext" options, output can be similar to djvu text layer.
Nov 16 2019
Nov 16 2019
Have you considered to get data from dumps, if possible?
In T238448#5668653, @Bugreporter wrote:not all (special) projects are Wikidata clients.
In T235500#5663181, @DD063520 wrote:@Xqt : hi, would you consider again this patch, currently I make this modification locally to have some code running
Nov 15 2019
Nov 15 2019
In T238404#5668434, @DD063520 wrote:Mhmmm .... on github I cannot find the "ban" language in the wikipedia_family:
So would the newest version solve my problem?
I think it was added here: 2c565d01c381a858c4d02cf1f4c3372b589d5422
Works for me, ban is already in wikipedia_family.py.
It is quite slow, though.
Nov 14 2019
Nov 14 2019
Make _flush aware of _putthread ongoing tasks
Nov 9 2019
Nov 9 2019
Oct 27 2019
Oct 27 2019
Mpaa added a comment to T236614: Page.title(as_filename=True) don't remove "\"" (quotes) forbidden character.
This is OS dependent, on Linux it is acceptable.
Sep 22 2019
Sep 22 2019
site_tests.py: add test for site.assert_valid_iter_params
site_tests.py: fix test_preload_templates_and_langlinks
Aug 16 2019
Aug 16 2019
[bugfix] Fix the comparison in archivebot
proofreadpage.py: fix footer detection
Aug 15 2019
Aug 15 2019
If you are going to submit a patch with the fix (plus possibly tests), it will be appreciated.
You can also use https://tools.wmflabs.org/gerrit-patch-uploader/ in case you do not want to go through the standard process (git/gerrit).
Aug 13 2019
Aug 13 2019
flake8: fix error C412-Unnecessary list comprehension
Aug 11 2019
Aug 11 2019
Aug 6 2019
Aug 6 2019
It also broke archiving on en.wikisource, which uses "User:Wikisource-bot/config".
Jul 18 2019
Jul 18 2019
archivebot.py: don't reorder template parameters
Jul 16 2019
Jul 16 2019
Wouldn't loading config from self.page.raw_extracted_templates instead of self.page.templatesWithParams() keep the right order?
Jul 6 2019
Jul 6 2019
proofreadpage_tests.py: Fix variable name
djvu.py: fix name of variable for filename
Jun 18 2019
Jun 18 2019
It looks like googleOCR answer is not deterministic.
An option could be to check that at least x% of chars are equal instead of full equality.
The purpose is to check that the query to googleOCR is successful, not to test google algorithm.
Apr 9 2019
Apr 9 2019
Mpaa added a comment to T219376: retrieveMetaData() in DjVuImage.php creates knock-on error when a page has invalid text layer.
See T214729
Apr 1 2019
Apr 1 2019
Mpaa added a comment to T219281: Move -except from add_text.py and -excepttext from replace.py to global page generator filters.
I would suggest -grep and -grepnot, similar to -titleregex and -titleregexnot.
Feb 16 2019
Feb 16 2019
Mpaa added a comment to T212076: proofreadpage_tests.TestPageOCR.test_ocr_googleocr sometimes fails with ValueError.
In T212076#4959217, @Xqt wrote:Now we have a json.decoder.JSONDecodeError:
https://ci.appveyor.com/project/ladsgroup/pywikibot-g4xqx/build/job/dhuv540dipdiw9si
Feb 14 2019
Feb 14 2019
Wikisource is needing to touch 000000s of files across multiple languages.
That should probably have its own bug report... extensions are expected to work correctly without constant bot maintenance.
Feb 11 2019
Feb 11 2019
Dvorapa awarded T198452: Always enable namespace filtering in QueryGenerator a Like token.
Feb 3 2019
Feb 3 2019
Mpaa added a comment to T214729: Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.
Note:
I fixed the file on Commons.
The buggy file is https://commons.wikimedia.org/w/index.php?title=File:Philosophical_Transactions_-_Volume_053.djvu&oldid=336270264
Feb 2 2019
Feb 2 2019
Dvorapa awarded T214234: add -querypage parameter to pagegenerators a Manufacturing Defect? token.
Jan 30 2019
Jan 30 2019
Kizule awarded T214234: add -querypage parameter to pagegenerators a Love token.
I do not agree.
There is a patch proposing to introduce -querypage, which will be valid for all special pages, in order to avoid proliferation of argument.
See T214234.
Jan 26 2019
Jan 26 2019
Mpaa added a comment to T214729: Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.
Something like this.
... $txt = preg_replace_callback( $reg, [ $this, 'pageTextCallback' ], $txt ); $reg_failed = '/(?m)^failed$/'; $txt = preg_replace_callback( $reg_failed, [ $this, 'pageTextCallbackFailed' ], $txt ); txt = "<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n" . $txt . "</BODY>\n</DjVuTxt>\n";
Jan 25 2019
Jan 25 2019
Mpaa added a comment to T214729: Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.
The improvement is to be done in DjVuImage.php: function retrieveMetaData()
https://doc.wikimedia.org/mediawiki-core/master/php/DjVuImage_8php_source.html#l00246
Mpaa added a comment to T214729: Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.
It is the djvutxt that fails for this page 51:
Mpaa added a comment to T214729: Error during parsing of djvu text layer to produce metadata leads to page offset in ProofreadPage extension.
I guess the problem might lie somewhere here:
https://doc.wikimedia.org/mediawiki-core/master/php/DjVuImage_8php_source.html#l00308
Jan 19 2019
Jan 19 2019
Jan 16 2019
Jan 16 2019
Jan 5 2019
Jan 5 2019
Mpaa closed T205190: code cleanup: remove deprecation warning for -dry option in basic.py as Resolved.
Jan 4 2019
Jan 4 2019
Mpaa added a comment to T212741: classes derived from object should always call super in Initializer.
Not sure this is a bug or a must, it is more about how you want the classes to cooperate with each other.
MRO chain is not broken in the first example, the design is such that the chain of calls does not propagate all the way up (it might also be a design choice, see e.g. https://rhettinger.wordpress.com/2011/05/26/super-considered-super/#comment-86).
Dec 29 2018
Dec 29 2018
qpoffset is used, the issue is same as T173293.
Dec 19 2018
Dec 19 2018
Mpaa added a comment to T212076: proofreadpage_tests.TestPageOCR.test_ocr_googleocr sometimes fails with ValueError.
I copy it here for convenience.
Interesting, it looks like googleOCR answer is not deterministic or some bytes are lost somewhere.
Dec 16 2018
Dec 16 2018
Mpaa removed a project from T211813: SSL CERTIFICATE_VERIFY_FAILED on generating family file: Pywikibot.
Mpaa added a comment to T212076: proofreadpage_tests.TestPageOCR.test_ocr_googleocr sometimes fails with ValueError.
I think it is has been a temporary unavailability of googleOCR service.
Dec 12 2018
Dec 12 2018
I don't think it is a pywikibot issue.
Nov 25 2018
Nov 25 2018
OK, we worked at the same time.
Added more check in proofreadpage.py
PYSETUP_TEST_EXTRAS=1 installs bs4
Nov 24 2018
Nov 24 2018
It fails on:
"env": "LANGUAGE=en FAMILY=wikipedia PYWIKIBOT_TEST_PROD_ONLY=1",
and pass on:
"env": "LANGUAGE=zh FAMILY=wikisource PYSETUP_TEST_EXTRAS=1 PYWIKIBOT_TEST_PROD_ONLY=1 PYWIKIBOT_TEST_NO_RC=1",
so I guess it is related to family.
Nov 23 2018
Nov 23 2018
Mpaa added a comment to T205223: Update TestProofreadPageValidSite.test_json_format to not use the deprecated `rvcontentformat` parameter.
This is not correct.
proofreadpage.py uses "contentformat" in edit action, which is not deprecated. See https://en.wikisource.org/w/api.php?action=help&modules=edit
Nov 19 2018
Nov 19 2018
proofreadpage_tests are still waiting for upstream fixes.
Nov 15 2018
Nov 15 2018
The problem is that in site.preloadpages(), max_ids is computed after the pageslist is splitted in chunks of 240.
for sublist in itergroup(pagelist, groupsize): <----------- groupsize = 240 # Do not use p.pageid property as it will force page loading. pageids = [str(p._pageid) for p in sublist if hasattr(p, '_pageid') and p._pageid > 0] cache = {} # In case of duplicates, return the first entry. for priority, page in enumerate(sublist): try: cache.setdefault(page.title(with_section=False), (priority, page)) except pywikibot.InvalidTitle: pywikibot.exception()
Oct 19 2018
Oct 19 2018
Thank you.
Oct 18 2018
Oct 18 2018
@JJMC89. could you please elaborate about it? So I might try to look into it. Thanks
Aug 25 2018
Aug 25 2018
In T113450#4517887, @zhuyifei1999 wrote:In T113450#4513357, @JAnD wrote:I am afraid, this is not because bot is broken, but because database is broken.
File a bug against MediaWiki-libs-Rdbms. Pywikibot itself has nothing to do with this.
Aug 20 2018
Aug 20 2018
BTW, I could not find a way to edit a page directly in VE. So I need to switch.
Hi. Also here I switched from wikitext to VE. And then just save, no other actions.
In T202197#4516595, @matmarex wrote:I don't think this needed to be marked as "Unbreak Now!". The issue has existed for a while, and ProofreadPage is able to deal with missing or invalid usernames given in the "user" field (it must be – existing usernames already saved in existing content can become invalid, e.g. when the user is renamed).
But, since I already investigated it, there's not a lot of work to actually make the patch…
Mpaa raised the priority of T202197: Visual Editor removes user from ProofreadPage header from Medium to Needs Triage.
Yes, that it is how I reproduced. To clarify: 1) I went to edit mode, wikitext by default, then w/o saving 2) I switched to VE, 3) I saved.
Aug 18 2018
Aug 18 2018
The user seems to be blanked after a second edit to a Proofread page.
See https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page:From%20Kulja,%20across%20the%20Tian%20Shan%20to%20Lob-Nor%20(1879).djvu/273&rvlimit=10&rvprop=tags%7Ctimestamp%7Cuser%7Ccomment%7Ccontent%7Cinfo
Mpaa triaged T202197: Visual Editor removes user from ProofreadPage header as Unbreak Now! priority.
Aug 14 2018
Aug 14 2018
Mpaa updated subscribers of T201904: Add "pagequality" right to User rights when logged in via OAuth.
Just for the records, think the issue is with an account be logged in via OAuth or not.
Also "Mpaa" behaves differently depending on how I am logged in.
Aug 13 2018
Aug 13 2018
Aug 12 2018
Aug 12 2018
I found out that MpaaBot has no 'pagequality' rights.
Mpaa works as expected instead.
I think this is not an issue wth ProofreadPage but with permissions of users on enwikisource.
@Tpt, could you please look into this? Thanks.
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL