Page MenuHomePhabricator

[5hrs] Add corpus of test pages
Closed, ResolvedPublic

Description

Even though the Reading Web team is working on covering the extension with unit tests – and with acceptance tests where it makes sense – it's still valuable to have a set of common and edge cases to exercise the codebase with, and, if possible, to reproduce bugs with.

Here is the known features and corresponding articles on Wikipedia:

FeaturePage (Link to a revision)Notes
Redirect pageObamaA page that redirects to another
Hat note before lead sectionBarack ObamaSee top of the article
Lead sectionBarack ObamaLead section
InfoboxBarack ObamaInfobox in lead section, image inside infobox
NavboxBarack ObamaNavbox in lead section
ImagesBarack ObamaThroughout the article
Coordinates before lead sectionSan FranciscoSee the top of the article
Coordinates inside infoboxSan FranciscoSee the infobox in the lead section
TableSan FranciscoTable of data
ReferencesSan FranciscoSee section "References" and links references in the body of the article
NotesSan FranciscoSee section "Notes"
Sister project linksSan FranciscoSee section "External links"
Geographic locationSan FranciscoSee section "External links"
Hat noteSan FranciscoSee section "History"
Lists in 1st paragraphPlanetSee the first paragraph
Math formulaVector space
MapKochi_MetroSee the Planning section and click on the "map" link under the image on the right.
Interactive graphAny page from https://en.wikipedia.org/wiki/Category:Pages_with_graphs. Since graphs are not widely deployed, we can come back to it.
Embedded audioAnnouncerSee towards the bottom
Embedded video1984_State_of_the_Union_Address
HieroglyphsEgyptian_hieroglyphs
Article linked to WikidataBarack_ObamaWikidata entry: https://www.wikidata.org/wiki/Q76
Article in a RTL languageSan Francisco in Arabic
Page that belongs to some categoriesBarack ObamaSee towards the bottom of the article
?

AC

  • Test pages should be useful to test various Reading Web maintained extensions.
  • Each page should be an importable XML file.
  • Besides a variety of articles we need to create pages that are featured, protected, etc. (TODO: what else).

Related: T62116

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 10 2016, 8:41 AM
• jhobs triaged this task as Normal priority.Jun 15 2016, 3:24 PM
• jhobs moved this task from Incoming to Triaged but Future on the Readers-Web-Backlog board.
• jhobs added a project: Technical-Debt.
MaxSem moved this task from Unsorted to Test coverage on the Technical-Debt board.Jul 17 2016, 7:05 PM
Jdlrobson updated the task description. (Show Details)Jul 20 2016, 8:53 PM

@phuedx do you mind if we repurpose this task to include other extensions we maintain? The corpora would be useful to test not only HoverCards, but MobileFrontend with Math extension enabled too, for example.

Change 315061 had a related patch set uploaded (by Phuedx):
Add corpus of pages for manual testing

https://gerrit.wikimedia.org/r/315061

bmansurov updated the task description. (Show Details)Oct 10 2016, 2:37 PM
bmansurov updated the task description. (Show Details)Oct 17 2016, 4:12 PM
phuedx updated the task description. (Show Details)Oct 17 2016, 4:14 PM
MBinder_WMF added a subscriber: MBinder_WMF.

@phuedx Please email this to list for analysis per meeting

Questions from the Story Prioritization meeting:

Some things I'd like to resolve before we work on this:

  • I'm still not a 100% clear what should happen to T104561 - do we close/repurpose/sail it out into the phabricator ocean
  • Putting the xml inside Hovercards extension seems wrong to me - I'd prefer it inside a separate repo or gist. What are the advantages to bundling it inside the extension rather than getting it via curl?

Questions from the Story Prioritization meeting:

I think this task is about identifying the articles and is subset of the other task. T104561 is a longer term goal, and this task is specific and achievable faster.

  • @jhobs: Should this live in a separate extension or be hosted somewhere other than the codebase?

I think it should definitely be outside any existing repository because as I see it, the list of articles will be useful for testing multiple extensions.

bmansurov updated the task description. (Show Details)Nov 4 2016, 9:19 PM
phuedx added a comment.Nov 7 2016, 9:39 AM

OTOH having everything you need to test code in the same repository as that code also makes sense.

How about we make a npm module which packages up all these wiki pages and then inserts them into the wiki by invoking the relevant commands. This way it's reusable in other part of the repos and part of the code repo.

With regards to T104561, given the high value of having such a thing, I feel like the additional work is useful. IMO a third party sysadmin should not have to worry about developer artifacts in their install.

With regards to T104561, given the high value of having such a thing, I feel like the additional work is useful. IMO a third party sysadmin should not have to worry about developer artifacts in their install.

What I should've said in T137527#2775622 was I don't see why these test pages should be treated differently from unit, functional, integration, and browser tests.

OTOH having everything you need to test code in the same repository as that code also makes sense.

I agree, maybe a git submodule that we can reuse in all of our repositories.

To reiterate from Sprint Prioritization meeting: I'm in favor of creating an npm module and including it as a dev dependency on any extensions we want to use the same test pages for.

Decided in Prioritization that this task needs sharper scoping before being estimated

ovasileva renamed this task from Add corpus of test pages to [5hrs] Add corpus of test pages.Nov 30 2016, 5:38 PM
ovasileva updated the task description. (Show Details)
pmiazga updated the task description. (Show Details)Nov 30 2016, 5:40 PM
bmansurov removed bmansurov as the assignee of this task.
bmansurov moved this task from To Do to Doing on the Reading-Web-Sprint-87-♨️😭 board.
bmansurov updated the task description. (Show Details)Dec 5 2016, 10:14 PM
bmansurov updated the task description. (Show Details)
bmansurov updated the task description. (Show Details)Dec 6 2016, 11:15 PM
bmansurov updated the task description. (Show Details)

I'll use this JSON file in my script (that I haven't written yet):

(The data is taken from the table in the description, with slight changes to avoid getting different versions of the same article).

Progress so far: I've successfully patched and compiled MWDumper to work with the latest MediaWiki. It's fast, but the imported page has malformed HTML. Here is a screenshot:

Since the above program only imports the page content, and doesn't update the links table I had to run the rebuildall.php maintenance script. It took too long so I stopped it. I haven't looked into importing links table dumps yet.

Jdlrobson added a subscriber: Tgr.Dec 7 2016, 11:10 PM

You'll need to parse templates (rather than import) to avoid malformed pages (remember some users don't have scribunto installed or exactly the same extensions)

I think @Tgr suggested this on one of the older vagrant specific tasks.

Tgr added a comment.Dec 8 2016, 12:03 AM

That would be T118549.

Update:

  1. I've tried exporting templates of a page first, i.e. https://en.wikipedia.org/w/api.php?action=query&revids=747296630&generator=templates&export&exportnowrap&gtllimit=500 and importing it using importDump.php. This was fast, a couple of minutes. I then imported the page itself, i.e. https://en.wikipedia.org/w/api.php?action=query&revids=747296630&export&exportnowrap, which took about 10mins.
  1. I've also tried importing a preparsed Article (https://en.wikipedia.org/wiki/Barack_Obama?action=raw&templates=expand), but when saved as a page, still getting the same malformed HTML as in T137527#2855808.

Also simple edits are taking too long on my machine (even after increasing the lua and shell memory limits).

@Tgr, how can I find out which extensions are needed for properly displaying a page? Say this page: https://en.wikipedia.org/wiki/Barack_Obama.

Tgr added a comment.EditedDec 8 2016, 3:25 AM

@Tgr, how can I find out which extensions are needed for properly displaying a page? Say this page: https://en.wikipedia.org/wiki/Barack_Obama.

I don't know if can. You can use the parse tree to see what xml-style parser tags are used on the page:

$ echo 'cat //ext/name/text()' | xmllint --shell <(curl -s 'https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Barack+Obama&prop=parsetree&formatversion=2' | jq --raw-output .parse.parsetree) | grep -v '^[ /]'  | sort | uniq
pre
ref

but that won't help you with parser extensions that are processed before wikitext parsing (such as Scribunto).

Thanks, @Tgr.

So, to wrap up what I've done, here is some tangible stuff that others can use.

XML dumps

Dump of templates (252 in total) that are used in the articles we've identified (see description):


Command to get the dump:

curl -L -o templates.xml "https://en.wikipedia.org/w/api.php?action=query&revids=608505859|747296630|747851639|749521518|751950456|752422162|731883794|733582458|748489039|21621292&generator=templates&export&exportnowrap&gtllimit=500"

Dump of articles without templates (10 in total):

curl -L -o articles.xml "https://en.wikipedia.org/w/api.php?action=query&revids=608505859|747296630|747851639|749521518|751950456|752422162|731883794|733582458|748489039|21621292&export&exportnowrap"

SQL

If you want to convert the above XML files into SQL, you can use the following compiled version of mwdumper:

. The file has been tested to work JRE 1.8.0_111 and JRE 1.7.0_79. I had to remove this line as the page_counter field has been removed in MW 1.25.

You can use the above jar file as so:

java -jar mwdumper-1.25.jar --format=sql:1.5 templates.xml > templates.sql
java -jar mwdumper-1.25.jar --format=sql:1.5 articles.xml > articles.sql

Here are the converted files:

and .

In order to import the above generated SQL, you'll need to empty some tables:

DELETE FROM page; DELETE FROM text; DELETE FROM revision;

Now import:

mysql -u username -p database_name < templates.sql
mysql -u username -p database_name < articles.sql

You also need to update the links tables (see mwdumper documentation), but I haven't gotten that far.

Even after importing the above articles, you'll likely have some problems with malformed HTML. That's mostly happening because of infobox.

Fin.

Thanks, @Tgr.
So, to wrap up what I've done, here is some tangible stuff that others can use.

XML dumps

Dump of templates (252 in total) that are used in the articles we've identified (see description):


Command to get the dump:

curl -L -o templates.xml "https://en.wikipedia.org/w/api.php?action=query&revids=608505859|747296630|747851639|749521518|751950456|752422162|731883794|733582458|748489039|21621292&generator=templates&export&exportnowrap&gtllimit=500"

Dump of articles without templates (10 in total):

curl -L -o articles.xml "https://en.wikipedia.org/w/api.php?action=query&revids=608505859|747296630|747851639|749521518|751950456|752422162|731883794|733582458|748489039|21621292&export&exportnowrap"

SQL

If you want to convert the above XML files into SQL, you can use the following compiled version of mwdumper:

. The file has been tested to work JRE 1.8.0_111 and JRE 1.7.0_79. I had to remove this line as the page_counter field has been removed in MW 1.25.

Any reason we'd want to do all this? Why not use php maintenance/importDump.php for example?

You can use the above jar file as so:

java -jar mwdumper-1.25.jar --format=sql:1.5 templates.xml > templates.sql
java -jar mwdumper-1.25.jar --format=sql:1.5 articles.xml > articles.sql

Here are the converted files:

and .
In order to import the above generated SQL, you'll need to empty some tables:

DELETE FROM page; DELETE FROM text; DELETE FROM revision;

This seems scary... could you explain why that's necessary? Is there a safer less destructive way to do this?

Now import:

mysql -u username -p database_name < templates.sql
mysql -u username -p database_name < articles.sql

You also need to update the links tables (see mwdumper documentation), but I haven't gotten that far.
Even after importing the above articles, you'll likely have some problems with malformed HTML. That's mostly happening because of infobox.
Fin.

Tgr added a comment.EditedDec 12 2016, 10:10 PM

I have just implemented something similar as a puppet role in https://gerrit.wikimedia.org/r/#/c/326394 (but linked it to the wrong task). Need help from someone familiar with rspec (and ideally puppetlabs_spec_helper) to finish it up.

It exports pages, not dumps, but it would be fairly easy to write an identical one for dumps. I still think that's the wrong way to go since you cannot do template expansion on the server side and will have to go down into the subtemplate/parserfunction/Scribunto/whatever else rabbit hole.

See the doc of that role about malformed HTML.

Any reason we'd want to do all this? Why not use php maintenance/importDump.php for example?

importDump.php is slow.

This seems scary... could you explain why that's necessary? Is there a safer less destructive way to do this?

I didn't investigate other options as the task was timeboxed. Basically, without clearing your tables there will be ID conflicts and this seemed an acceptable solution for a development wiki.

@Tgr, thanks for working on this.

@phuedx as writer of task any questions?

I think that helping @Tgr getting rMWVA17a4e396e098: Add mediawiki::import::url and rMWVA58eebe6fbafe: [WIP] Add tests for custom pupper functions in I0cf57632f merged would be A Good Thing™.

However, I'm aware that not all Reading Web engineers – nor WMF engineers and volunteers – use MediaWiki Vagrant. @Tgr's Puppet role doesn't use anything that isn't available in a standard MediaWiki installation, which should make it very easy to write an equivalent script for folk who can't or won't use MWV.

Questions:

  • Is duplicating @Tgr's effort a little a reasonable cost for a tool that's useful for everyone?
    • For example:
cd /path/to/mediawiki/extensions/MobileFrontend
./scripts/import_test_pages
  • If so, then by how many hours should we increase the length of this spike?

For me, Vagrant support should be the first goal. It's what we use on staging and it's a reason for people to use Vagrant (I may even start using it again if that's an option!)
I think this spike is nebulous enough that we could call the work so far done in terms of this sprint and revisit in a grooming session.

bmansurov removed bmansurov as the assignee of this task.Dec 15 2016, 5:26 PM
bmansurov added subscribers: pmiazga, Jhernandez.

@jhobs, @Jhernandez, or @pmiazga would one of you please sign off on the task?

IMO it's signed off. We shouldn't resolve this as there's more to do here.

• jhobs closed this task as Resolved.Dec 15 2016, 6:28 PM
• jhobs claimed this task.

In order to properly track our sprint work and because this task was timeboxed, let's go ahead and resolve this.

IMO it's signed off. We shouldn't resolve this as there's more to do here.

@Jdlrobson, can you make a new task specifically targeting what more you think there is to do here (Vagrant support, I'm guessing)?

I didn't write the task so I defer to @phuedx but I'm very confused at what has been done here and what has been resolved. The description seems very vague so it's not clear what the exact expected outcome was here and whether it has been met, but I'm also not sure how it maps to Popups and my life is no easier importing stock articles. A/C also doesn't seem to be met.

Just looking at the acceptance criteria:

Test pages should be useful to test various Reading Web maintained extensions.

Was the goal to identify test pages or to provide easy access to them or both?

Each page should be an importable XML file.

Where are the XML files? I see some zips hidden inside a comment, but how do I find these when I need them when the inevitable happens - I lose this phabricator task url? Is it linked from https://mediawiki.org/Reading ?

Besides a variety of articles we need to create pages that are featured, protected, etc. (TODO: what else).

Is the TODO resolved? Have we done this?

Other question:

phuedx added a comment.EditedJan 3 2017, 12:21 PM

@Jdlrobson: I'm not sure those AC are actually relevant 😕

Thanks @phuedx that seems more actionable.

Change 315061 abandoned by Phuedx:
Add corpus of pages for manual testing

Reason:
See the discussion in T137527.

https://gerrit.wikimedia.org/r/315061