Page MenuHomePhabricator

test_detect_site is failing on Travis
Closed, ResolvedPublic

Description

____________________ SiteDetectionTestCase.test_detect_site ____________________
self = <tests.site_detect_tests.SiteDetectionTestCase testMethod=test_detect_site>
    def test_detect_site(self):
        """Test detection of MediaWiki sites."""
        self.assertSite('http://www.hrwiki.org/index.php/$1')  # v 1.15
        self.assertSite('http://www.proofwiki.org/wiki/$1')
        self.assertSite(
            'http://www.ck-wissen.de/ckwiki/index.php?title=$1')
        self.assertSite('http://en.citizendium.org/wiki/$1')
        self.assertSite(
            'http://www.lojban.org/tiki/tiki-index.php?page=$1')
        self.assertSite('http://www.wikichristian.org/index.php?title=$1')
        self.assertSite('https://en.wikifur.com/wiki/$1')
        self.assertSite('http://bluwiki.com/go/$1')
        self.assertSite('http://kino.skripov.com/index.php/$1')
>       self.assertAllPass()
tests/site_detect_tests.py:183: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/site_detect_tests.py:115: in assertAllPass
    self.assertEqual(set(self.passes), set(self.all) - set(self.skips))
E   AssertionError: Items in the second set but not the first:
E   'http://www.wikichristian.org/index.php?title=$1'

See: https://travis-ci.org/wikimedia/pywikibot-core/jobs/177955824#L6321-L6345

Related Objects

Event Timeline

From Travis logs:

tests/site_detect_tests.py::SiteDetectionTestCase::test_detect_site WARNING: Http response status 500
failure http://www.wikichristian.org/index.php?title=$1 on <type 'exceptions.RuntimeError'>: Unsupported url: http://www.wikichristian.org/wiki/en/index.php?title=Main_Page

It actually is responsing with 500 from time to time.

wikichristian500.png (300×1 px, 53 KB)

wikichristian500content.png (348×684 px, 37 KB)

I've developed simple script, that request wikichristian and create simple statistics, how often it is failing.

Source of script:

from datetime import datetime
from datetime import timedelta

from pywikibot.site_detect import MWSite
from requests.exceptions import ReadTimeout

attempt_counter = 0
fail_counter = 0
max_successive_fail_counter = 0
max_outage = timedelta()
curr_successive_fail_counter = 0
curr_outage = timedelta()

global_start = datetime.now()
outage_start = None

while True:
    try:
        try:
            attempt_counter += 1
            _ = MWSite('http://www.wikichristian.org/index.php?title=$1')
        except (RuntimeError, ReadTimeout):
            if curr_successive_fail_counter == 0:
                outage_start = datetime.now()

            fail_counter += 1
            curr_successive_fail_counter += 1
        else:
            if curr_successive_fail_counter > 0:
                curr_outage = datetime.now() - outage_start
            max_outage = max(max_outage, curr_outage)

            max_successive_fail_counter = max(max_successive_fail_counter,
                                              curr_successive_fail_counter)
            curr_successive_fail_counter = 0
    except KeyboardInterrupt:
        break

global_stop = datetime.now()

print('''
Start testing at: {}
End testing at: {}
Testing time: {}
Attempt count: {}
Failure count: {}

Longest outage: {}
Maximum successive fails: {}
'''.format(global_start,
           global_stop,
           global_stop - global_start,
           attempt_counter,
           fail_counter,
           max_outage,
           max_successive_fail_counter))

I'm starting running it now. We will see, what can we do here.

Ok, after few hours of running it I've got:

Start testing at: 2016-12-10 10:32:19.391190
End testing at: 2016-12-10 16:26:55.108310
Testing time: 5:54:35.717120
Attempt count: 10857
Failure count: 2439

Longest outage: 0:04:11.880775
Maximum successive fails: 8

My current plan to mitigate this failure is to split tests (that we will check every site in separate test and retry test 10 times befoure raising failure.

Instead of retrying, which lengthens the test duration, I think it would be OK if we just skipped the test. We already do this for 503 errors. I'll upload a patch.

Change 326313 had a related patch set uploaded (by Dalba):
site_detect.py: Raise ServerError on 500 response

https://gerrit.wikimedia.org/r/326313

I was thinking about such solution and figure out that I want to go with bolder move here. I'm in the middle fo rewriting whole test module here.

I will make separate testes for every site, create decorator that will embed test in for loop (in such manner that we will be able to retry test for given number of time).

I also removed this whole unnecessary logic with set operations - we don't need that, as we just want to test if site is detect correclty and it is perfectly done by one site, one unitttest aproach.

In the and I also removed whole test_IWM tjhing, beacuse that wasn't test, there were no assertions at all, it was just iterating over some set of sites and populating few sets (all, pass, error, fail etc,) and at the end printing some simple statistics. It was not able to fail, so there is no reason to call it on every change in CI environment.

BTW I'm merging Your change, as it is the right direction for PWB to raise this error here, but I believe it's falisifying test here (cause this page exists and what does it mean, that we are skiping test on ServerError at all?)

Change 326313 merged by jenkins-bot:
site_detect.py: Raise ServerError on 500 response

https://gerrit.wikimedia.org/r/326313

Change 326358 had a related patch set uploaded (by Magul):
Fix www.wikichristian.org detection test

https://gerrit.wikimedia.org/r/326358

@Dalba I have push patch for review and would be grateful for review.

Patch could be seen as a little bit controversial, as it contains reqrite of almost whole site_detect_tests.py rewrite, but I believe there were some design flows in previous tests, so I simplified it. In my version it is simply detecting sites (or failing during) and it is doing it in dead simple way (without these self.all, self.passes, self.skips, self.failures and self.errors.

It looks, that http://www.lojban.org/tiki/tiki-index.php?page=$1 is also causing a problem. I will check how much retries we should provide here.

Ok, I've got results from my script, but I'm thinking that there is something wrong here

Start testing at: 2016-12-13 01:17:18.864291
End testing at: 2016-12-13 09:27:16.940101
Testing time: 8:09:58.075810
Attempt count: 35293
Failure count: 334

Longest outage: 0:00:36.897220
Maximum successive fails: 165

It looks, that it require further investigation.

Ok, I added some more statistics informations about failing test on www.lojban.org:

from datetime import datetime
from datetime import timedelta

from pywikibot.site_detect import MWSite
from pywikibot.exceptions import ServerError
from requests.exceptions import ConnectionError, ReadTimeout

attempt_counter = 0
fail_counter = 0
max_successive_fail_counter = 0
max_outage = timedelta()
curr_successive_fail_counter = 0
curr_outage = timedelta()

global_start = datetime.now()
outage_start = None

fails_statistics = {}

while True:
    try:
        try:
            attempt_counter += 1
            _ = MWSite('http://www.lojban.org/tiki/tiki-index.php?page=$1')
        except (AttributeError, ConnectionError, ReadTimeout, RuntimeError, ServerError):
            if curr_successive_fail_counter == 0:
                outage_start = datetime.now()

            fail_counter += 1
            curr_successive_fail_counter += 1
        else:
            if curr_successive_fail_counter > 0:
                curr_outage = datetime.now() - outage_start
            max_outage = max(max_outage, curr_outage)
            fails_statistics[curr_successive_fail_counter] = fails_statistics.get(
                curr_successive_fail_counter, 0) + 1

            max_successive_fail_counter = max(max_successive_fail_counter,
                                              curr_successive_fail_counter)
            curr_successive_fail_counter = 0
    except KeyboardInterrupt:
        break

global_stop = datetime.now()

print('''
Start testing at: {}
End testing at: {}
Testing time: {}
Attempt count: {}
Failure count: {}

Longest outage: {}
Maximum successive fails: {}

Fails statistics: {}
'''.format(global_start,
           global_stop,
           global_stop - global_start,
           attempt_counter,
           fail_counter,
           max_outage,
           max_successive_fail_counter,
           fails_statistics))

and after 5 hours of running it got:

Start testing at: 2016-12-13 20:21:57.746518
End testing at: 2016-12-14 01:07:24.801149
Testing time: 4:45:27.054631
Attempt count: 18649
Failure count: 134

Longest outage: 0:00:19.348121
Maximum successive fails: 68

Fails statistics: {0: 18506, 1: 6, 60: 1, 68: 1}

After that results I'm thinking about rerunning test 100 times. Ususally it will take only one run, but it occurs, that 60-70 successive failures is not such uncommon.

Is there a particular reason/need to have lojban.org in the tests? If there isn't then simply removing it could make sense.

Also since I'm on mobile and can't leave this content in gerrit:

  • You have referenced the wrong parameter in the docstring for the decorator.
  • Many of the new docstrings don't include epydoc fields for their parameters.

Other than that and the 100 retries for lojban.org the patch looks good and feels like a clear improvement over the old structure.

Is there a particular reason/need to have lojban.org in the tests? If there isn't then simply removing it could make sense.

No, there is no more sophisticated reason, than that it was there already. I'm removing this test.

Also since I'm on mobile and can't leave this content in gerrit:

  • You have referenced the wrong parameter in the docstring for the decorator.

Fixed in new patch. Also adding epydoc description for it.

  • Many of the new docstrings don't include epydoc fields for their parameters.

I feel, that there is not much value in adding parameteres infromations here. I would like to add them to aseertSite and assertNoSite but all tests accept self and return None or raises AssertionError (or at least should). Do You think, that we should copy it over to all tests' docstrings?

Other than that and the 100 retries for lojban.org the patch looks good and feels like a clear improvement over the old structure.

New patch with Your comments uploaded.

  • Many of the new docstrings don't include epydoc fields for their parameters.

I feel, that there is not much value in adding parameteres infromations here. I would like to add them to aseertSite and assertNoSite but all tests accept self and return None or raises AssertionError (or at least should). Do You think, that we should copy it over to all tests' docstrings?

It was only the new asserts I was thinking about, sorry for being vague before.

I should be able to do a proper review later today or tomorrow.

Change 326358 merged by jenkins-bot:
Fix www.wikichristian.org and remove www.lojban.org detection tests

https://gerrit.wikimedia.org/r/326358

Ok, just after merge and failing builds on Travis I figure out some shortcommings of new aproach: we are asserting if site is instance of MWSite, but doesn't catch all exception, that can happened during testing assertion, so newly created decorator doesn't swallow AssertionError 10 times, it just fails on any of AttributeError, ConnectionError, RuntimeError, ServerError, Timeout at the first try.

I will publish patch today's evening.

Change 327514 had a related patch set uploaded (by Magul):
Raise AssertionError on any failure in assertSite

https://gerrit.wikimedia.org/r/327514

Change 327514 merged by jenkins-bot:
Raise AssertionError on any failure in assertSite

https://gerrit.wikimedia.org/r/327514