Page MenuHomePhabricator

Gather a set of sample articles to load on test/staging instances.
Closed, DeclinedPublic

Description

We need a big sample of articles that is in every test/staging instance and developer machine so that we can test on the same cases and have the same test data.

Such test data should be in a unique place so that we can refer to it as our canonical Web-Team-Backlog test data repository.
Suggested place: https://www.mediawiki.org/wiki/Reading/QA/Sample_articles


AC

  • We have documented where this lives and how to import it.
  • Such sample data is available in staging/test environments.
  • The test articles from T137527 are added to mw:Reading/QA/Sample_articles.
  • The mobilefrontend MediaWiki-Vagrant role imports a couple of these test pages.
  • Developers have been notified about such sample data.

Known Unknowns

  • Which pages should be imported by the role?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Sampling is unnecessarily complex and you'll end up with something non-representative. Just use an entire smallish project in a language you understand, for instance simple.wiki (150 MB) or en.wikiquote (100 MB).

Sampling is unnecessarily complex and you'll end up with something non-representative.

I understand the request as having a sample of enwiki content to test, potentially hand curated (ie: making sure the Barack Obama article is in there) based on multiple factors like: article popularity, article layout complexity, article length, etc. I don't think a full simple.wiki or enwikiquote would be as useful in those corner cases (though they might have their own interesting corner cases that are worth including as well!).

When people have to set up a wiki by cherry-picking content from another wiki (hence templates, modules etc. etc.), they usually are driven crazy. If you have dozens person-hours to invest on this report, or only care about a couple pages, it may work.

@greg exactly. The existence of templates makes frontend dev unnecessarily complicated and complex and to test bugs it's useful to have common templates.

Examples of articles that I have had to manually import historically which would be useful to be bundled:

  • Article with more than one language
  • Article with infobox
  • Article with timeline tag
  • Article with clean up template
  • Article with navbox

We don't actually need an entire dump of a project... for one it would run extremely slow on a local vagrant instance. Essentially we need copies of the more commonly used templates/extension tags.

Change 226230 had a related patch set uploaded (by Robmoen):
WIP: Add testarticles role.

https://gerrit.wikimedia.org/r/226230

Note that pages imported via mediawiki::import_dump cannot be updated. mediawiki::import_text was created to work around this (and to have meaningful diffs). It also handles errors (such as two roles trying to create the same page) nicer. For testcases and demo pages IMO it is a better tool.

For importing selected articles from enwiki, instead of checking in huge XML dumps, I would rather see MW-Vagrant becoming able to import wiki articles on the fly. Something like vagrant import-articles articlelist.txt would be much more flexible than every group working with vagrant creating their own testarticle role.

T48869 is more or less the same as this.

@Tgr Thanks for the knowledge drop. I agree this would be a better solution than having large dumps in the repo and to limit the amount of roles. I'm guessing a hit to the api would be best to get the unparsed content? eg api.php?action=parse&page=PageName&prop=wikitext. Then import_text as a string?

Importing the same page again then would then update it. To have the correct diffs then, it would need to import each rev from the local to current. Though maybe that could be implemented later on. I'm keen to start working on this so any additional guidance is appreciated.

I'm guessing a hit to the api would be best to get the unparsed content? eg api.php?action=parse&page=PageName&prop=wikitext.

I would guess the revisions API is slightly faster than the parse API (but might be wrong about that). If you are importing an article, you probably want all the dependent templates though, and neither can do that. You would need to get an XML dump via Special:Export (I don't think we have an API for that).

Worth keeping in mind that Vagrant struggles with complicated templates which build on lots of sub-templates and Lua modules. We had to create our own version of Commons' {{Information}} template for the CommonsMetadata role because MediaWiki just timed out trying to parse the template chains.

Then import_text as a string?

Puppet has a declarative syntax so you can't directly return values from commands (resources in the puppet lingo) in roles, but you could save the wikitext/XML dump to a file and then invoke import_text/import_dump on that file.

If you want to do this as a vagrant command instead of a puppet role, you would probably have an easier time conceptually but had to reimplement the import logic in ruby. I'm not really familiar with that; @dduvall might be a good person to ask.

Importing the same page again then would then update it.

This would work with import_text but not import_dump (if you want to do it as a role - again, probably not problematic as a vagrant command). Vagrant applies roles every once in a while, and has to decide which operations to redo, so they all have bailouts to save time. import_dump just checks if the page already exists.

IMO if you import articles from a live wiki, updating is not a concern. The results are not going to be reproducible anyway, because the page context depends on the exact time of setting up your vagrant box; and boxes are not meant to last for long.

It seems roles are the way to go. Since labs-vagrant doesn't contain the mediawiki-vagrant plugin, commands like import-dump or the potential for new commands is not there.
Going with a role, there still is the issue of how to best fetch the dump and import it. Seems this is something that loads of effort has already gone into without coming to an acceptable solution.

Jdlrobson added a subscriber: rmoen.

@rmoen said he's back to square one. @dduvall would you or anyone you can recommend be free to work together and get something up and running here. I think it would benefit a lot of teams.

I suppose we could do something like this (for each page; below only Obama's page is given):

curl -d "&pages=Barack_Obama&dir=desc&set=2015-01-01T00:00:00Z&limit=5&templates&action=submit" http://en.wikipedia.org/w/index.php?title=Special:Export -o "download.xml"

Which downloads "Barack Obama"'s history starting January 1, 2015 and saving the next 5 revisions to download.xml (this will make sure that everyone has the same version and not the most up-to-date version of the page). We then can use mediawiki::import_dump to import the file.

Note that pages imported via mediawiki::import_dump cannot be updated.

I'm not sure why that matters in our case? We want to create a fixed set of articles so that anyone on the team can reproduce bugs on the same page. Come to think about it, it's preferable that these imported pages are not editable.

I think the next steps are to identify the pages we want to import and to write a puppet manifest (which I don't know how to do yet).

The following are the pages @Jdlrobson wanted in T62116#995367:

A page with a complete infobox e.g. a clone of the Barack Obama page (including all templates)
A page for working with Wikidata e.g. A clone of the Albert Einstein wikidata (installed in wikidata instance) and linked wikipedia entry
A page with clean up templates
A page with alternative languages e.g. San Francisco
All existing wikidata.org property definitions

Feel free to add your suggested pages here.

Does the above sound reasonable?

I'm not sure why that matters in our case? We want to create a fixed set of articles so that anyone on the team can reproduce bugs on the same page. Come to think about it, it's preferable that these imported pages are not editable.

They are editable but cannot be updated by updating the dump file. That might or might not be problematic depending on how you plan to use them (e.g. if you plan to add new pages for regression tests then it's a bad thing; if you just want some random article so everyone uses the same blob of HTML for performance tests then there is no reason to update).

Personally, I would still recommend avoiding the XML dump and instead getting the article text via action=raw&templates=expand. MediaWiki templates can be pretty hard on Vagrant (plus you will need to install Lua and parserfunctions and who knows what else). Having the servers do the hard work is a lot more convenient (unless you want to test editing, in which case you will need those anyway).

Jdlrobson edited a custom field.

Somewhat related: T26774 (expandtemplates does not work in <ref>)

@Jdlrobson any way of importing the article images too? Or is that impossible with the current proposed approach?

Change 226230 abandoned by Robmoen:
WIP: Add testarticles role.

https://gerrit.wikimedia.org/r/226230

Yup if you have wgInstantCommons on (which I think is the default - images will just work)

There is no easy way to import images in the strict sense, short of using the API to get a list of them and then uploading one by one - which could be scripted and puppetized but it would be a lot of work, and it would be fragile because your vagrant box would have to deal then with thumbnailing huge images, converting various image formats and so on. If you are not specifically testing media handling capacities, you are probably not interested in that. Instead, the source wiki can just be defined as a remote file repository so images would still show up but they would be thumbnailed remotely.

@Jdlrobson @rmoen @bmansurov I'd like to move this so that we work on it in the future but I'm not sure there's a clear path for achieving the acceptance criteria I outlined.

Can you guys have a look and talk if we're ready for implementing this, or if it needs more talking? What can and can't be accomplished?

There is a clear path forward => T107770 :)

Jhernandez lowered the priority of this task from High to Medium.Aug 14 2015, 9:18 AM
Jhernandez updated the task description. (Show Details)

Split the vagrant-specific part of this into T118549. I intend to work on it if no one else picks it up, but not anytime soon.

Jhernandez lowered the priority of this task from Medium to Low.Jul 27 2016, 4:14 PM
Jhernandez moved this task from Incoming to Triaged but Future on the Web-Team-Backlog board.

I relaxed the constraint that these pages are also available on BC.

The plan ("implementation notes") was superseded by mediawiki::import::url (see T118549: Import live wiki pages into MediaWiki-Vagrant ).

It seems T156408 might make this proposal redundant as we'd have all production articles for testing purposes...

It seems T156408 might make this proposal redundant as we'd have all production articles for testing purposes...

For MobileFrontend, yes… but not for PP and RelatedArticles.

It's my understanding we can already tweak PP to use a RESTBase URL other than local (which is how Baha tested the RESTBase integration)
We would have to do the same for RelatedArticles (getting that on RESTBase would be a good idea..)

(and the ContentProvider can be easily tweaked to run on the desktop skin using onOutputPageBeforeHTML...)

It's my understanding we can already tweak PP to use a RESTBase URL other than local (which is how Baha tested the RESTBase integration)
We would have to do the same for RelatedArticles (getting that on RESTBase would be a good idea..)

(and the ContentProvider can be easily tweaked to run on the desktop skin using onOutputPageBeforeHTML...)

There's definitely the facility to make our codebases load content from a production server but should we? It's easy/familiar (we write code all the time!) but it's inherently awkward, right?

mediawiki::import::url makes adding articles to the environment easy/repeatable for developers that use MWV and doesn't require any additions to any codebase. Now, I understand that not everyone uses MWV for local development but our staging server and other temporary servers hosted on Wikimedia Labs do.

I'd argue importing content from a production wiki is even more awkward... Most of the team is not using Vagrant at this point so the original idea of using a vagrant role doesn't seem useful. Meanwhile @pmiazga and @bmansurov have been using proxies to test RESTbase API queries locally for Page previews. I think there is more merit in this approach personally. Our sample articles should be the entire content of Wikipedia not just a subset.

Most of the team is not using Vagrant at this point so the original idea of using a vagrant role doesn't seem useful.

… to us. You are right though.

Meanwhile @pmiazga and @bmansurov have been using proxies to test RESTbase API queries locally for Page previews. I think there is more merit in this approach personally.

I think we're both guilty of arguing a position with little concrete evidence. I've found this proxy-based approach is very helpful for testing edge cases that we find in the wild. I've also found creating specific pages to reproduce bugs very helpful. It's always going to be about finding the right balance.

Edit

I think we (Reading Web) don't have the bandwidth for doing that right now.