In preparation for T140377, we should come up with a suite of test cases for InternetArchiveBot. This would basically be a long list of WikiText strings that include various citation templates, raw external links, and various well-formed and malformed versions thereof.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Cyberpower678 | T120433 Migrate dead external links to archives | |||
Resolved | Cyberpower678 | T140377 Build robust testing framework for InternetArchiveBot (epic) | |||
Resolved | MusikAnimal | T140386 Create acceptance tests for IABot |
Event Timeline
Added some very basic acceptance tests here:
https://github.com/wikimedia/Cyberbot_II/commit/d756c30466948fbf3ffe9f33a073de0628a97adf
We need to write a script to run these tests.
Pull request at https://github.com/cyberpower678/Cyberbot_II/pull/32/files
This is a work in progress in the sense that we need better test cases. I wasn't able to get the bot to run without {{dead link}} templates, so there's that, and we should go through reported bugs in the past and try to add test cases for those.
In order for the acceptance tests to work, you also need have your deadlink.config.local.inc.php set up properly by setting $debug to true, the $debugPage to the appropriate page (see comment above testPages()), and set $debugStyle to 'test'. $testMode despite the name should be false. I will document this.
With the most recent version of IABot, it's (properly) updating the date parameter of the {{wayback}} template, which comes from the Wayback API. This means we need some complex system of regex to ensure these are not compared in the acceptance tests. We should triage it's importance, or perhaps omit testing of wayback templates.
@MusikAnimal: Don't the citation templates have a similar issue with the archivedate parameter?
I think the archivedate is the date of when the archive was made, so if we are given an accessdate (when the link was originally accessed) the archive date should be consistent. Even when no accessdate was given, I'm seeing the same archive date is being chosen. For the "statistics.sk" examples it's going with a archive from 2007. Perhaps that was the most recent archive available, not sure what the logic is. But the good news is we just want to test the parser, so if we know certain URLs always have the same archive date we can use those and fiddle with the syntax to cover the scenarios we want.
I ended up striping out dates from the source and result, which effectively gets around the issue of them being different for each run. It's true some dates should be retained (accessdate for instance) and we are unable to test that, but overall we still have pretty good coverage. I've gone through all the bug reports on @Cyberpower678's talk page from July onward and have made test cases for them.
PR ready at https://github.com/cyberpower678/Cyberbot_II/pull/32