Page MenuHomePhabricator

test_search_where is failing on Travis
Closed, ResolvedPublic

Description

FAIL: test_search_where (tests.site_tests.SearchTestCase)
Test the site.search() method with 'where' parameter.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/wikimedia/pywikibot-core/tests/site_tests.py", line 1405, in test_search_where
    list(self.site.search('wiki', total=10, where='text')))
AssertionError: Lists differ: [Page[82 chars](美利坚合众国宪法), Page(美国2013年[137 chars]釋)] != [Page[82 chars](美国2013年国情咨文), Page(美国宪法[138 chars]釋)]
First differing element 3:
[[zh:美利坚合众国宪法]]
[[zh:美国2013年国情咨文]]
  [Page(台灣關係法),
   Page(小氣財神),
   Page(明實錄/世宗肅皇帝實錄),
-  Page(美利坚合众国宪法),
   Page(美国2013年国情咨文),
   Page(美国宪法修正案),
   Page(美国权利法案),
   Page(美國1823年國情咨文),
+  Page(美國1845年國情咨文),
   Page(解放奴隶公告),
   Page(隸釋)]

See: https://travis-ci.org/wikimedia/pywikibot-core/jobs/177955827#L4709-L4731

Event Timeline

where default parameter is text so this two generators should return the same objects (and in fact they are - in any case they are returning 57 elements on zh.wikisource). The question right now is if Mediawiki is returning them in deterministic order?

Change 327654 had a related patch set uploaded (by Magul):
Search all results and sort them

https://gerrit.wikimedia.org/r/327654

I don't know how many Wikimedia sites currently hit the 10000 search limit, but "wiki" as fairly common word among wikis and I won't be surprized if there are dozens of such wikis. The current patch will probably feel slow and will be skipped on a number of sites. To avoid these consequences, I thought maybe we can use a less common phrase instead of "wiki". Will something like "aaaaa" work?

I've tested 707 sites (all production sites: commons, wikidata, wikipedia, wikivoyage, wiktionary, wikibooks, wikinews, wikiquote, wikisource, wikiversity). The whole test takes around 25 minuts.

I'm pasting here 10 sites which lasts longest (in this 10 sites are all 7 sites that have 10000 or more occurences of wiki), rest is available here: P4646

sitetimedeltacounter
wikipedia:eu0:00:10.5268236890
wikivoyage:de0:00:12.3501526147
wikipedia:en0:00:12.71043010000
wikipedia:it0:00:12.7844697594
wikipedia:de0:00:13.51314810000
wikipedia:ru0:00:14.09375810000
wikidata:wikidata0:00:14.92891510000
wikipedia:pt0:00:16.20609310000
wikipedia:hu0:00:21.26410910000
wikipedia:ml0:00:24.36744310000

If You want to rerun test here's a script to do such: P4645 (beside pywikibot and stdlib it uses tqdm, which is a package for progress bar and You can simply remove it from script or install it from pip).

Assuming that we in fact will exlude places, where wiki search hits 10000 occurences, we could estimate that typical run of this test will take no more than 25 seconds.

It's arguable if it's long, especially in perspective of the whole test suite, that we have (and I really doubt, that there's anybody that run all tests locally before pushing any change to gerrit).

What do You think about that @Dalba ? Do You want to change searched word to something else? What will be so universally to be usefull on all sites, that we could test?

OK, I could not come up with a "universally useful" search term. Personally, I don't like waiting 20seconds for such test. I don't think it's essential enough to keep it with this cost, especially that it is being skipped on some of the most important wikis (enwiki, dewiki and wikidata). But as you said these are debatable.

Although I'm not very happy about it, but I'm convinced by the information above that the situation is not that bad and it's OK by me if anyone wants to merge the change already.

P.S:
There is another possible solution if anyone wants to look more into it:, The API provides [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch | srqiprofile ]] parameter which can be set to empty and hopefully that can give us deterministic results without needing to fetch the whole 10000 results.
But I have not tested it, maybe does not work as I think it will, and we don't have that parameter implemented in our search method (yet).

There is another possible solution if anyone wants to look more into it:, The API provides [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch | srqiprofile ]] parameter which can be set to empty and hopefully that can give us deterministic results without needing to fetch the whole 10000 results.

Looking at this it still seems as though two subsequent searches should return deterministic results (since they are ranked by the same profile), assuming the profile doesn't take previous searches into account. I did spot though that zh.wiki uses a different profile from en.wiki so possibly that is the reason somehow.

Looking at this it still seems as though two subsequent searches should return deterministic results (since they are ranked by the same profile), assuming the profile doesn't take previous searches into account. I did spot though that zh.wiki uses a different profile from en.wiki so possibly that is the reason somehow.

Good point.

I don't know if @Magul,has tested the patch or not, but for me it seems that it does not resolve the issue:

(Testing on fawiki)

================================== FAILURES ===================================
____________________ SearchTestCase.test_search_where_text ____________________

self = <tests.site_tests.SearchTestCase testMethod=test_search_where_text>

    def test_search_where_text(self):
        """
            Test the site.search() method with 'where' parameter set to text.
    
            Upper limit of result returned from search is 10000 so if we want to
            have deterministic result we have to test search only on sites with
            lower number of results.
            """
        if len(list(self.site.search('wiki'))) >= 10000:
            self.skipTest('Search result is bigger then 10000')
        self.assertEqual(sorted(self.site.search('wiki')),
>                        sorted(self.site.search('wiki', where='text')))
E       AssertionError: Lists differ: [Page[996 chars]('آرامگاه خیام'), Page('آراگون'), Page('آرتاس [114102 chars]م)')] != [Page[996 chars]('آراگون'), Page('آرتاس منتیل'), Page('آرتاشس [114137 chars]م)')]
E       
E       First differing element 50:
E       Page('آرامگاه خیام')
E       Page('آراگون')
E       
E       Diff is 129706 characters long. Set self.maxDiff to None to see it.

site_tests.py:1423: AssertionError

(Testing on zhwiki)

================================== FAILURES ===================================
____________________ SearchTestCase.test_search_where_text ____________________

self = <tests.site_tests.SearchTestCase testMethod=test_search_where_text>

    def test_search_where_text(self):
        """
            Test the site.search() method with 'where' parameter set to text.
    
            Upper limit of result returned from search is 10000 so if we want to
            have deterministic result we have to test search only on sites with
            lower number of results.
            """
        if len(list(self.site.search('wiki'))) >= 10000:
            self.skipTest('Search result is bigger then 10000')
        self.assertEqual(sorted(self.site.search('wiki')),
>                        sorted(self.site.search('wiki', where='text')))
E       AssertionError: Lists differ: [Page[2032 chars]ge('Clang'), Page('Climateprediction.net'), Pa[52781 chars]问答')] != [Page[2032 chars]ge('Chromium'), Page('Clang'), Page('Climatepr[52763 chars]问答')]
E       
E       First differing element 108:
E       Page('Clang')
E       Page('Chromium')
E       
E       Diff is 65027 characters long. Set self.maxDiff to None to see it.

(second run on fawiki)

================================== FAILURES ===================================
____________________ SearchTestCase.test_search_where_text ____________________

self = <tests.site_tests.SearchTestCase testMethod=test_search_where_text>

    def test_search_where_text(self):
        """
            Test the site.search() method with 'where' parameter set to text.
    
            Upper limit of result returned from search is 10000 so if we want to
            have deterministic result we have to test search only on sites with
            lower number of results.
            """
        if len(list(self.site.search('wiki'))) >= 10000:
            self.skipTest('Search result is bigger then 10000')
        self.assertEqual(sorted(self.site.search('wiki')),
>                        sorted(self.site.search('wiki', where='text')))
E       AssertionError: Lists differ: [Page[628 chars]خرین دلدار او'), Page('آخرین دلدار او'), Page([114519 chars]م)')] != [Page[628 chars]خرین تعظیم'), Page('آخرین هواشکن'), Page('آخون[114425 chars]م)')]
E       
E       First differing element 34:
E       Page('آخرین دلدار او')
E       Page('آخرین تعظیم')
E       
E       Diff is 129465 characters long. Set self.maxDiff to None to see it.

site_tests.py:1423: AssertionError

(second run on zhwiki)

================================== FAILURES ===================================
____________________ SearchTestCase.test_search_where_text ____________________

self = <tests.site_tests.SearchTestCase testMethod=test_search_where_text>

    def test_search_where_text(self):
        """
            Test the site.search() method with 'where' parameter set to text.
    
            Upper limit of result returned from search is 10000 so if we want to
            have deterministic result we have to test search only on sites with
            lower number of results.
            """
        if len(list(self.site.search('wiki'))) >= 10000:
            self.skipTest('Search result is bigger then 10000')
        self.assertEqual(sorted(self.site.search('wiki')),
>                        sorted(self.site.search('wiki', where='text')))
E       AssertionError: Lists differ: [Page[2032 chars]ge('Clang'), Page('Climateprediction.net'), Pa[52779 chars]问答')] != [Page[2032 chars]ge('Chromium'), Page('Clang'), Page('Climatepr[52781 chars]问答')]
E       
E       First differing element 108:
E       Page('Clang')
E       Page('Chromium')
E       
E       Diff is 65090 characters long. Set self.maxDiff to None to see it.

I could not pass the test with the patch, but without it the test used to pass occasionally. Increasing the total limit seems to have increased the fail probability.

Shit, I haven't test it, just assumed, that it will work. I will looks further into it. :/

Change 330677 had a related patch set uploaded (by Dalba):
site_tests.py: Remove test_search_where_text and test_search_where_nearmatch

https://gerrit.wikimedia.org/r/330677

Change 330677 merged by jenkins-bot:
site_tests.py: Remove test_search_where_text and test_search_where_nearmatch

https://gerrit.wikimedia.org/r/330677

@Dalba found a reason why we have problems and remove 2 tests. I'm closing it as resolved.

Change 327654 abandoned by Magul:
Search all results and sort them

Reason:
test was removed

https://gerrit.wikimedia.org/r/327654