Page MenuHomePhabricator

English Wikisource has more good pages than French Wikisource, breaking the WikiStats tests for the largest wikisource
Closed, ResolvedPublic

Description

According to https://meta.wikimedia.org/wiki/Wikisource , French Wikisource still has more pages , but English Wikisource has more 'good' pages. need to check which algorithm change broke this test

FAIL: test_sort (tests.wikistats_tests.WikiStatsTestCase)
Test sorted results.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/wikistats_tests.py", line 35, in test_sort
    self.assertEqual(ws.languages_by_size('wikisource')[0], 'fr')
AssertionError: u'en' != u'fr'
- en
+ fr

Event Timeline

The sorting order of wikistat was always "good" articles as I remember. You may search through the languages_by_size history of wikisource_family.

Change 269730 had a related patch set uploaded (by Xqt):
[bugfix] top['total'] item is basestring not unicode

https://gerrit.wikimedia.org/r/269730

Ah, I see the test was broken by your patch 9c4732f42f07 , which you +2'd as a trivial change :P

The main thing is the test works, regardless of whether the result is incorrect :P
Anyway the related patch waits 5 weeks for reviewing it (and some others too).

The sorting order of wikistat was always "good" articles as I remember. You may search through the languages_by_size history of wikisource_family.

Well, WikiStats lists French Wikisource as now currently the highest in number of pages and number of good pages

http://s23.org/wikistats/wikisources_html.php , with data update date of 2014-10-21 :/

       all       good
fr  1399468	1677708
en  1029165	1614470

However special:statistics has English Wikisource as higher in good pages

       all       good
fr  2006427	252594
en  1903161     568107

Very strangely different results. scratching head

s23.org is outdated for years and wikistats 2.0 has "good" articles as sorting order too.[1]
The problem is that statistic content may change and we cannot assume the order is always kept. Maybe we should remove that test expecting a language code on a specific line.

[1] https://wikistats.wmflabs.org//display.php?t=ws

Got an additional error when testing with current release:

C:\pwb\GIT\core>pwb.py tests/wikistats_tests -v
tests: max_retries reduced from 25 to 1
WARNING: WikiStats: unicodecsv package required for using csv in Python 2; falli
ng back to using the larger XML datasets.
test_csv (__main__.WikiStatsTestCase)
Test CSV. ... skipped 'unicodecsv not installed.'
test_sort (__main__.WikiStatsTestCase)
Test sorted results. ... FAIL
test_xml (__main__.WikiStatsTestCase)
Test XML. ... ok

======================================================================
FAIL: test_sort (__main__.WikiStatsTestCase)
Test sorted results.
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".\tests\wikistats_tests.py", line 33, in test_sort
    self.assertIsInstance(top['total'], UnicodeType)
AssertionError: '38708706' is not an instance of <type 'unicode'>

----------------------------------------------------------------------
Ran 3 tests in 1.863s

The instance-of bug is quite strange.
I've added extra asserts (https://github.com/jayvdb/pywikibot-core/commit/03db2eedd2c60486ef066b94b24f314da1012123), and yet I cant reproduce it on linux (local machine or travis), and on Appveyor the first two builds (Python 2.6.6 and 2.7.2) both pass. waiting for the other builds...

Change 275511 had a related patch set uploaded (by John Vandenberg):
Expand wikistats datatype tests

https://gerrit.wikimedia.org/r/275511

All Appveyor windows environments pass these additional test_xml asserts. So I dont understand why it is failing for you :/

maybe you have a version of unicodecsv that is broken?

C:\pwb\GIT\core>pwb.py tests/wikistats_tests -v
tests: max_retries reduced from 25 to 1
WARNING: WikiStats: unicodecsv package required for using csv in Python 2; falling back to using the larger XML datasets.
test_csv (__main__.WikiStatsTestCase)
Test CSV. ... skipped 'unicodecsv not installed.'
test_sort (__main__.WikiStatsTestCase)
Test sorted results. ... FAIL
test_xml (__main__.WikiStatsTestCase)
Test XML. ... FAIL

======================================================================
FAIL: test_sort (__main__.WikiStatsTestCase)
Test sorted results.
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".\tests\wikistats_tests.py", line 34, in test_sort
    for key in top.keys()
AssertionError: False is not true

======================================================================
FAIL: test_xml (__main__.WikiStatsTestCase)
Test XML.
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".\tests\wikistats_tests.py", line 68, in test_xml
    for key in data.keys()
AssertionError: False is not true

----------------------------------------------------------------------
Ran 3 tests in 2.010s

:(

I haven't unicodecsv installed and csv test is skipped therefore.

Another installation with py 2.7.10 fails too:

C:\pwb\core>pwb.py tests/wikistats_tests -v
tests: max_retries reduced from 25 to 1
WARNING: WikiStats: unicodecsv package required for using csv in Python 2; falling back to using the larger XML datasets.
WARNING: C:\pwb\core\pywikibot\version.py:100: DeprecationWarning: pywikibot.version.getversion_svn is deprecated; use getversion_svn_setuptools instead.
  (tag, rev, date, hsh) = vcs_func(_program_dir)

WARNING: C:\pwb\core\pywikibot\version.py:248: DeprecationWarning: pywikibot.version.svn_rev_info is deprecated; use getversion_svn_setuptools instead.
  tag, rev, date = svn_rev_info(_program_dir)

test_csv (__main__.WikiStatsTestCase)
Test CSV. ... skipped 'unicodecsv not installed.'
test_sort (__main__.WikiStatsTestCase)
Test sorted results. ... FAIL
test_xml (__main__.WikiStatsTestCase)
Test XML. ... ok

======================================================================
FAIL: test_sort (__main__.WikiStatsTestCase)
Test sorted results.
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".\tests\wikistats_tests.py", line 33, in test_sort
    self.assertIsInstance(top['total'], UnicodeType)
AssertionError: '38708706' is not an instance of <type 'unicode'>

----------------------------------------------------------------------
Ran 3 tests in 2.094s

FAILED (failures=1, skipped=1)

C:\pwb\core>

I've not added your additional test PS here.

thx for testing this.
so the problem is definitely in the xml implementation, but that is a standard module.

Appveyor is testing using 2.6.6 (64bit only), 2.7.2 (64bit only) and 2.7.11 (32 and 64bit).

If your Python version is different, I can add your current Python version to Appveyor to see if that is the problem.

my first py version is: '2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)]'
2nd is: '2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)]'

Additional information about xml.etree:

  1. $Id: __init__.py 3375 2008-02-13 08:05:08Z fredrik $
  2. elementtree package

I found the same error with ElementTree instead of cElementTree :(

This comment was removed by Xqt.

Change 275511 merged by jenkins-bot:
Expand wikistats datatype tests

https://gerrit.wikimedia.org/r/275511

Change 275780 had a related patch set uploaded (by Xqt):
[bugfix] force wikistats fields to unicode when using xml format

https://gerrit.wikimedia.org/r/275780

Change 269730 merged by jenkins-bot:
[bugfix] Additional wikistats tests

https://gerrit.wikimedia.org/r/269730

jayvdb renamed this task from WikiStats sorting algorithm causing unexpected largest wikisource to English Wikisource has more good pages than French Wikisource, breaking the WikiStats tests for the largest wikisource.Mar 8 2016, 7:34 PM
jayvdb closed this task as Resolved.
jayvdb assigned this task to Xqt.