Gather a set of sample articles to load on test/staging instances.
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Jhernandez
	Jul 2 2015, 11:43 AM

Description

We need a big sample of articles that is in every test/staging instance and developer machine so that we can test on the same cases and have the same test data.

Such test data should be in a unique place so that we can refer to it as our canonical Web-Team-Backlog test data repository.
Suggested place: https://www.mediawiki.org/wiki/Reading/QA/Sample_articles

AC

We have a varied amount of articles, files and templates stashed.
T137527 and mw:Reading/QA/Sample_articles are varied enough, right? See https://phabricator.wikimedia.org/T137527#2858072 for summary of work so far by @bmansurov

~~We have documented where this lives and how to import it.~~
~~Such sample data is available in staging/test environments.~~
The test articles from T137527 are added to mw:Reading/QA/Sample_articles.
The mobilefrontend MediaWiki-Vagrant role imports a couple of these test pages.

Developers have been notified about such sample data.

Known Unknowns

Which pages should be imported by the role?

Details

	Subject	Repo	Branch	Lines +/-
	WIP: Add testarticles role.	mediawiki/vagrant	master	+24 K -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	• Jhernandez	T104317 [GOAL]: Release process and staging environment
Resolved	• Jhernandez	T100296 [EPIC] Create a formal release process for MobileFrontend/Gather
Declined	• Jhernandez	T104324 [EPIC] Create a development staging environment and branch.
Declined	Jdlrobson	T104556 Merge to development branch on MobileFrontend by default
Resolved	• Jhernandez	T104559 Setting up staging is reproducible by engineers
Resolved	phuedx	T104558 Set up a staging environment for development
Resolved	bd808	T104994 Staging environment for Reading Web
Resolved	Jdlrobson	T113972 Switch Gather development to dev branch
Duplicate	None	T105733 Establish corpus of pages to be used for testing
Declined	None	T104561 Gather a set of sample articles to load on test/staging instances.
Resolved	Jdlrobson	T107770 Spike: Identify the best way forward for including sample articles in Vagrant
Resolved	Jdlrobson	T110013 Compile a list of articles to test Quick survey insertion in
Resolved	Tgr	T118549 Import live wiki pages into MediaWiki-Vagrant

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• KLans_WMF moved this task from Incoming to 2016-17 Q2 on the Web-Team-Backlog board.Jul 6 2015, 5:14 PM

• KLans_WMF removed a project: Reading-Web-Sprint-51-YOLO.

• Jhernandez edited projects, added Reading-Web-Sprint-52-Zoolander; removed Web-Team-Backlog.Jul 19 2015, 2:15 PM

• KLans_WMF moved this task from Needs Analysis to To Do on the Reading-Web-Sprint-52-Zoolander board.Jul 20 2015, 4:17 PM

Jdlrobson added a parent task: T105733: Establish corpus of pages to be used for testing.Jul 20 2015, 5:06 PM

Jdlrobson updated the task description. (Show Details)Jul 20 2015, 8:45 PM

• rmoen claimed this task.Jul 20 2015, 11:09 PM

• rmoen moved this task from To Do to Doing on the Reading-Web-Sprint-52-Zoolander board.

greg subscribed.Jul 21 2015, 12:54 AM

Sampling is unnecessarily complex and you'll end up with something non-representative. Just use an entire smallish project in a language you understand, for instance simple.wiki (150 MB) or en.wikiquote (100 MB).

In T104561#1466574, @Nemo_bis wrote:

Sampling is unnecessarily complex and you'll end up with something non-representative.

I understand the request as having a sample of enwiki content to test, potentially hand curated (ie: making sure the Barack Obama article is in there) based on multiple factors like: article popularity, article layout complexity, article length, etc. I don't think a full simple.wiki or enwikiquote would be as useful in those corner cases (though they might have their own interesting corner cases that are worth including as well!).

When people have to set up a wiki by cherry-picking content from another wiki (hence templates, modules etc. etc.), they usually are driven crazy. If you have dozens person-hours to invest on this report, or only care about a couple pages, it may work.

@greg exactly. The existence of templates makes frontend dev unnecessarily complicated and complex and to test bugs it's useful to have common templates.

Examples of articles that I have had to manually import historically which would be useful to be bundled:

Article with more than one language
Article with infobox
Article with timeline tag
Article with clean up template
Article with navbox

We don't actually need an entire dump of a project... for one it would run extremely slow on a local vagrant instance. Essentially we need copies of the more commonly used templates/extension tags.

Change 226230 had a related patch set uploaded (by Robmoen):
WIP: Add testarticles role.

https://gerrit.wikimedia.org/r/226230

gerritbot added a project: Patch-For-Review.Jul 21 2015, 9:35 PM

Note that pages imported via mediawiki::import_dump cannot be updated. mediawiki::import_text was created to work around this (and to have meaningful diffs). It also handles errors (such as two roles trying to create the same page) nicer. For testcases and demo pages IMO it is a better tool.

For importing selected articles from enwiki, instead of checking in huge XML dumps, I would rather see MW-Vagrant becoming able to import wiki articles on the fly. Something like vagrant import-articles articlelist.txt would be much more flexible than every group working with vagrant creating their own testarticle role.

T48869 is more or less the same as this.

Nemo_bis unsubscribed.Jul 22 2015, 1:13 PM

@Tgr Thanks for the knowledge drop. I agree this would be a better solution than having large dumps in the repo and to limit the amount of roles. I'm guessing a hit to the api would be best to get the unparsed content? eg api.php?action=parse&page=PageName&prop=wikitext. Then import_text as a string?

Importing the same page again then would then update it. To have the correct diffs then, it would need to import each rev from the local to current. Though maybe that could be implemented later on. I'm keen to start working on this so any additional guidance is appreciated.

In T104561#1472577, @rmoen wrote:

I'm guessing a hit to the api would be best to get the unparsed content? eg api.php?action=parse&page=PageName&prop=wikitext.

I would guess the revisions API is slightly faster than the parse API (but might be wrong about that). If you are importing an article, you probably want all the dependent templates though, and neither can do that. You would need to get an XML dump via Special:Export (I don't think we have an API for that).

Worth keeping in mind that Vagrant struggles with complicated templates which build on lots of sub-templates and Lua modules. We had to create our own version of Commons' {{Information}} template for the CommonsMetadata role because MediaWiki just timed out trying to parse the template chains.

Then import_text as a string?

Puppet has a declarative syntax so you can't directly return values from commands (resources in the puppet lingo) in roles, but you could save the wikitext/XML dump to a file and then invoke import_text/import_dump on that file.

If you want to do this as a vagrant command instead of a puppet role, you would probably have an easier time conceptually but had to reimplement the import logic in ruby. I'm not really familiar with that; @dduvall might be a good person to ask.

Importing the same page again then would then update it.

This would work with import_text but not import_dump (if you want to do it as a role - again, probably not problematic as a vagrant command). Vagrant applies roles every once in a while, and has to decide which operations to redo, so they all have bailouts to save time. import_dump just checks if the page already exists.

IMO if you import articles from a live wiki, updating is not a concern. The results are not going to be reproducible anyway, because the page context depends on the exact time of setting up your vagrant box; and boxes are not meant to last for long.

It seems roles are the way to go. Since labs-vagrant doesn't contain the mediawiki-vagrant plugin, commands like import-dump or the potential for new commands is not there.
Going with a role, there still is the issue of how to best fetch the dump and import it. Seems this is something that loads of effort has already gone into without coming to an acceptable solution.

Jdlrobson moved this task from Doing to To Do on the Reading-Web-Sprint-52-Zoolander board.Jul 30 2015, 5:15 PM

@rmoen said he's back to square one. @dduvall would you or anyone you can recommend be free to work together and get something up and running here. I think it would benefit a lot of teams.

• bmansurov claimed this task.Jul 30 2015, 9:23 PM

• bmansurov moved this task from To Do to Doing on the Reading-Web-Sprint-52-Zoolander board.

I suppose we could do something like this (for each page; below only Obama's page is given):

curl -d "&pages=Barack_Obama&dir=desc&set=2015-01-01T00:00:00Z&limit=5&templates&action=submit" http://en.wikipedia.org/w/index.php?title=Special:Export -o "download.xml"

Which downloads "Barack Obama"'s history starting January 1, 2015 and saving the next 5 revisions to download.xml (this will make sure that everyone has the same version and not the most up-to-date version of the page). We then can use mediawiki::import_dump to import the file.

In T104561#1469169, @Tgr wrote:

Note that pages imported via mediawiki::import_dump cannot be updated.

I'm not sure why that matters in our case? We want to create a fixed set of articles so that anyone on the team can reproduce bugs on the same page. Come to think about it, it's preferable that these imported pages are not editable.

I think the next steps are to identify the pages we want to import and to write a puppet manifest (which I don't know how to do yet).

The following are the pages @Jdlrobson wanted in T62116#995367:

A page with a complete infobox e.g. a clone of the Barack Obama page (including all templates)
A page for working with Wikidata e.g. A clone of the Albert Einstein wikidata (installed in wikidata instance) and linked wikipedia entry
A page with clean up templates
A page with alternative languages e.g. San Francisco
All existing wikidata.org property definitions

Feel free to add your suggested pages here.

Does the above sound reasonable?

In T104561#1497038, @bmansurov wrote:

I'm not sure why that matters in our case? We want to create a fixed set of articles so that anyone on the team can reproduce bugs on the same page. Come to think about it, it's preferable that these imported pages are not editable.

They are editable but cannot be updated by updating the dump file. That might or might not be problematic depending on how you plan to use them (e.g. if you plan to add new pages for regression tests then it's a bad thing; if you just want some random article so everyone uses the same blob of HTML for performance tests then there is no reason to update).

Personally, I would still recommend avoiding the XML dump and instead getting the article text via action=raw&templates=expand. MediaWiki templates can be pretty hard on Vagrant (plus you will need to install Lua and parserfunctions and who knows what else). Having the servers do the hard work is a lot more convenient (unless you want to test editing, in which case you will need those anyway).

• KLans_WMF edited projects, added Web-Team-Backlog; removed Reading-Web-Sprint-52-Zoolander.Aug 3 2015, 4:12 PM

• bmansurov removed • bmansurov as the assignee of this task.Aug 3 2015, 4:12 PM

• bmansurov subscribed.

Jdlrobson updated the task description. (Show Details)Aug 4 2015, 10:59 PM

Jdlrobson edited a custom field.

Jdlrobson closed subtask T107770: Spike: Identify the best way forward for including sample articles in Vagrant as Resolved.Aug 4 2015, 11:02 PM

Somewhat related: T26774 (expandtemplates does not work in <ref>)

• Jhernandez mentioned this in T107770: Spike: Identify the best way forward for including sample articles in Vagrant.Aug 5 2015, 10:31 AM

@Jdlrobson any way of importing the article images too? Or is that impossible with the current proposed approach?

• Jhernandez mentioned this in T104324: [EPIC] Create a development staging environment and branch..Aug 5 2015, 11:02 AM

Change 226230 abandoned by Robmoen:
WIP: Add testarticles role.

https://gerrit.wikimedia.org/r/226230

Yup if you have wgInstantCommons on (which I think is the default - images will just work)

There is no easy way to import images in the strict sense, short of using the API to get a list of them and then uploading one by one - which could be scripted and puppetized but it would be a lot of work, and it would be fragile because your vagrant box would have to deal then with thumbnailing huge images, converting various image formats and so on. If you are not specifically testing media handling capacities, you are probably not interested in that. Instead, the source wiki can just be defined as a remote file repository so images would still show up but they would be thumbnailed remotely.

Jdlrobson merged a task: T105733: Establish corpus of pages to be used for testing.Aug 12 2015, 4:47 PM

Jdlrobson updated the task description. (Show Details)

Jdlrobson added a subscriber: • Tnegrin.

@Jdlrobson @rmoen @bmansurov I'd like to move this so that we work on it in the future but I'm not sure there's a clear path for achieving the acceptance criteria I outlined.

Can you guys have a look and talk if we're ready for implementing this, or if it needs more talking? What can and can't be accomplished?

There is a clear path forward => T107770 :)

• Jhernandez lowered the priority of this task from High to Medium.Aug 14 2015, 9:18 AM

• Jhernandez updated the task description. (Show Details)

• Jhernandez edited projects, added Reading-Web-Sprint-54-28-Days-Later; removed Web-Team-Backlog.

Jdlrobson mentioned this in T107603: Quick and external surveys shows and look properly on the desktop version of the site.Aug 21 2015, 9:38 PM

• Jhernandez closed subtask T110013: Compile a list of articles to test Quick survey insertion in as Resolved.Aug 28 2015, 10:51 AM

• Jhernandez edited projects, added Web-Team-Backlog; removed Reading-Web-Sprint-54-28-Days-Later.Aug 28 2015, 12:08 PM

Split the vagrant-specific part of this into T118549. I intend to work on it if no one else picks it up, but not anytime soon.

Tgr added a subtask: T118549: Import live wiki pages into MediaWiki-Vagrant .Nov 13 2015, 6:31 AM

Jdlrobson added a parent task: T100296: [EPIC] Create a formal release process for MobileFrontend/Gather.Jan 26 2016, 8:15 PM

• bmansurov moved this task from 2016-17 Q2 to Incoming on the Web-Team-Backlog board.Apr 27 2016, 4:34 PM

Jdlrobson removed a project: Patch-For-Review.Jul 21 2016, 8:55 PM

• Jhernandez lowered the priority of this task from Medium to Low.Jul 27 2016, 4:14 PM

• Jhernandez moved this task from Incoming to Triaged but Future on the Web-Team-Backlog board.

phuedx mentioned this in T137527: [5hrs] Add corpus of test pages.Oct 17 2016, 4:20 PM

Tgr closed subtask T118549: Import live wiki pages into MediaWiki-Vagrant as Resolved.Jan 3 2017, 5:08 AM

phuedx updated the task description. (Show Details)Jan 3 2017, 11:29 AM

I relaxed the constraint that these pages are also available on BC.

The plan ("implementation notes") was superseded by mediawiki::import::url (see T118549: Import live wiki pages into MediaWiki-Vagrant ).

phuedx updated the task description. (Show Details)Jan 3 2017, 12:19 PM

Jdlrobson updated the task description. (Show Details)Jan 3 2017, 12:51 PM

Jdlrobson mentioned this in T156408: Create interface between Skin and MobileFormatter.Apr 27 2017, 4:04 PM

Jdlrobson moved this task from Triaged but Future to Needs Prioritization on the Web-Team-Backlog board.May 2 2017, 6:20 PM

It seems T156408 might make this proposal redundant as we'd have all production articles for testing purposes...

In T104561#3228738, @Jdlrobson wrote:

It seems T156408 might make this proposal redundant as we'd have all production articles for testing purposes...

For MobileFrontend, yes… but not for PP and RelatedArticles.

It's my understanding we can already tweak PP to use a RESTBase URL other than local (which is how Baha tested the RESTBase integration)
We would have to do the same for RelatedArticles (getting that on RESTBase would be a good idea..)

(and the ContentProvider can be easily tweaked to run on the desktop skin using onOutputPageBeforeHTML...)

In T104561#3231779, @Jdlrobson wrote:

It's my understanding we can already tweak PP to use a RESTBase URL other than local (which is how Baha tested the RESTBase integration)
We would have to do the same for RelatedArticles (getting that on RESTBase would be a good idea..)

(and the ContentProvider can be easily tweaked to run on the desktop skin using onOutputPageBeforeHTML...)

There's definitely the facility to make our codebases load content from a production server but should we? It's easy/familiar (we write code all the time!) but it's inherently awkward, right?

mediawiki::import::url makes adding articles to the environment easy/repeatable for developers that use MWV and doesn't require any additions to any codebase. Now, I understand that not everyone uses MWV for local development but our staging server and other temporary servers hosted on Wikimedia Labs do.

I'd argue importing content from a production wiki is even more awkward... Most of the team is not using Vagrant at this point so the original idea of using a vagrant role doesn't seem useful. Meanwhile @pmiazga and @bmansurov have been using proxies to test RESTbase API queries locally for Page previews. I think there is more merit in this approach personally. Our sample articles should be the entire content of Wikipedia not just a subset.

bd808 unsubscribed.Jul 6 2017, 3:53 AM

In T104561#3410040, @Jdlrobson wrote:

Most of the team is not using Vagrant at this point so the original idea of using a vagrant role doesn't seem useful.

… to us. You are right though.

Meanwhile @pmiazga and @bmansurov have been using proxies to test RESTbase API queries locally for Page previews. I think there is more merit in this approach personally.

I think we're both guilty of arguing a position with little concrete evidence. I've found this proxy-based approach is very helpful for testing edge cases that we find in the wild. I've also found creating specific pages to reproduce bugs very helpful. It's always going to be about finding the right balance.

Edit

I think we (Reading Web) don't have the bandwidth for doing that right now.

Jdlrobson moved this task from Needs Prioritization to Tracking on the Web-Team-Backlog board.Aug 24 2017, 5:28 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

Jdlrobson moved this task from Untriaged to Discuss further on the Web-Team-Backlog (Tracking) board.Aug 24 2017, 6:35 PM

• bmansurov unsubscribed.Dec 22 2017, 9:49 PM