Page MenuHomePhabricator

Create a Generic List-building tool that can meet and exceed the applications of Pagepile
Open, Needs TriagePublic

Description

Pagepile is a really powerful way for exchanging data in the Wikimedia ecosystem -- it allows for sets of content pages to interact with tools like Petscan, Tabernacle, the Programs and Events dashboard, Citation Hunt, AC/DC and others to meaningfully exchange sets of pages. Wikimedia content is constantly changing so a query or list generated at one point of time, can very likely be different the next day by using the exact same strategy -- having a stable list of content allows for end users to do more consistent batch processing, tracking and support can be done with the same stable list of content.

However, because Pagepile is a Magnus tool and kindof obscure for folks who don't use batch processing tools -- creating a stable content list is often not very "easy" or user friendly: folks output data in unstable or not well-integrated formats like Wikidata Queries, on-wiki text lists, and CSV files which have to be reconverted (typically through something like Petscan) to be used elsewhere in the Wikimedia system. Moreover, because it is a volunteer-supported service we can't, for instance, rely on such a list in core platform features (i.e. if we build the neighbourhoods described in the Audiences strategy papers, if we support great content campaigns for projects on Commons through upload campaigns, etc).

Moreover pagepiles are static -- which means that they only output at one point, but the options for updating it mean you have to "regenerate" the list as a new output, even if its just an extension of the previous list (there is not "list history" so to speak -- instead each pile is a separate independent object). This means that "list owners" can't lean on these lists to update multiple other locations -- instead you have to recreate the list and manually update its use elsewhere. This kind of application is partially being solved by tools like Listeria, but that is entirely dependent on the Wikidata Query Service, but that only includes a subset of the inputs that you might use to form a list (with a tools like PattyPan, Quarry or Manual work).

WMF has already begun integrating a content list tool for the ContentTranslation Campaigns (see T96147) but it is not on a higher-level roadmap to generalize such a tool beyond the ContentTranslation environment. List building is a core behavior of almost every type of knowledge creation, maintenance and reporting in the movement -- and without a core-supported portable format for exchanging those lists across the projects a content list from a translation campaign, is not very useful for reporting/metrics, or for, example, generating a list for illustrating those same articles across the multiple languages, or for monitoring that same set of content for recent changes.

Is anyone thinking about these opportunities or issues?

P.S. We built a tool that demonstrates the kinds of applications that a more complex exchange format might encourage (i.e. prototype of creating a generic worklist separate from Translation for a GSOC project with @Surlycyborg @Meghasharma213: https://phabricator.wikimedia.org/T190555 and https://tools.wmflabs.org/worklist-tool/).

Event Timeline

Astinson renamed this task from [EPIC] Create a Generic List-building tool that can meet and exceed the applications of Pagepile to Create a Generic List-building tool that can meet and exceed the applications of Pagepile.Sep 3 2019, 2:48 PM
Astinson updated the task description. (Show Details)

Moreover pagepiles are static -- which means that they only output at one point, but the options for updating it mean you have to "regenerate" the list as a new output, even if its just an extension of the previous list (there is not "list history" so to speak -- instead each pile is a separate independent object). This means that "list owners" can't lean on these lists too update multiple other locations -- instead you have to recreate the list and manually update its use elsewhere. This kind of application is partially being solve by tools like Listeria, but that is entirely dependent on the Wikidata Query Service, but that only includes a subset of the criteria that you might use to form a list (with a tools like PattyPan, Quarry or Manual work).

Technically, anyone with Toolforge shell access can update a PagePile, I believe – they’re world-writable SQLite files. In the PHP API, this seems to be somewhat intended – you first create a fresh pile and then fill it with data, and nothing stops you from filling it with more data later. That doesn’t seem to be exposed through the web API, though.


I wonder if we need/want external storage for this at all. Could we simply store a list of pages on some wiki, either in Wikitext or as JSON? This gives us a history, a talk page, rate limiting, patrolling, authenticated editing (OAuth), and probably more, almost for free.

The biggest argument against this that I see right now is size limitations. On all Wikimedia wikis (I think), pages are limited to 2 MiB; meanwhile, on the current PagePile tool, there are some much larger piles:

lucaswerkmeister@tools-sgebastion-07:~$ find /data/project/shared/pagepile/ -type f -printf '%s\t%p\n' | sort -rn | head -25
33028096        /data/project/shared/pagepile/dir_c0/dir_a8/pagepile5c38eb5a66d6c5.50908789.sqlite
22216704        /data/project/shared/pagepile/dir_37/dir_67/pagepile5a9b224d7a4740.69285768.sqlite
20655104        /data/project/shared/pagepile/dir_b3/dir_dd/pagepile5be259e29e8c28.29418242.sqlite
20558848        /data/project/shared/pagepile/dir_d0/dir_34/pagepile5ca6131e0b0075.33259398.sqlite
20063232        /data/project/shared/pagepile/dir_4c/dir_f0/pagepile5b847a8d892466.94451744.sqlite
19695616        /data/project/shared/pagepile/dir_c9/dir_7b/pagepile5b017e602f6df2.06750058.sqlite
19695616        /data/project/shared/pagepile/dir_04/dir_9f/pagepile5b01839e973d27.13932156.sqlite
19685376        /data/project/shared/pagepile/dir_c7/dir_6a/pagepile598354940431d6.91688443.sqlite
19684352        /data/project/shared/pagepile/dir_16/dir_2e/pagepile5c1ab220017dc4.74301631.sqlite
19593216        /data/project/shared/pagepile/dir_68/dir_7d/pagepile5980821be3cb20.22656761.sqlite
18970624        /data/project/shared/pagepile/dir_4a/dir_23/pagepile58a681e49a7a60.06025608.sqlite
18910208        /data/project/shared/pagepile/dir_6f/dir_b0/pagepile59e9cf70bfe3c0.49748290.sqlite
18707456        /data/project/shared/pagepile/dir_f8/dir_ba/pagepile5b96c993849095.44039590.sqlite
18037760        /data/project/shared/pagepile/dir_3e/dir_af/pagepile597ef6f6411488.80133901.sqlite
18034688        /data/project/shared/pagepile/dir_c3/dir_b4/pagepile5b62bc1505e162.04906998.sqlite
18003968        /data/project/shared/pagepile/dir_a2/dir_b9/pagepile5b62bbe86aef23.72187936.sqlite
17940480        /data/project/shared/pagepile/dir_7c/dir_b0/pagepile5c4c48314f5d82.75477511.sqlite
17757184        /data/project/shared/pagepile/dir_d3/dir_5c/pagepile58adbd4d12bed5.57153231.sqlite
17449984        /data/project/shared/pagepile/dir_f7/dir_0f/pagepile5b62b906c65fa1.54130092.sqlite
17449984        /data/project/shared/pagepile/dir_22/dir_01/pagepile5b62b8f3d32118.29369288.sqlite
17449984        /data/project/shared/pagepile/dir_0c/dir_f7/pagepile5b62b64c0d0c43.19247313.sqlite
17241088        /data/project/shared/pagepile/dir_b2/dir_3d/pagepile5aed0ff4727008.08930187.sqlite
17234944        /data/project/shared/pagepile/dir_75/dir_08/pagepile5bf4f0974884c1.62251549.sqlite
16999424        /data/project/shared/pagepile/dir_23/dir_71/pagepile590e08cd07ba20.65330288.sqlite
16956416        /data/project/shared/pagepile/dir_22/dir_4f/pagepile58fcee8478e590.86060623.sqlite

SQLite might introduce some overhead, but I don’t think it’s so inefficient that a 33 MB database would somehow fit into a 2 MiB wiki page.

That said, only ~3½% of piles are that large:

lucaswerkmeister@tools-sgebastion-07:~$ find /data/project/shared/pagepile/ -type f -printf '%s\n' | awk '{ if ($1 >= (2048*1024)) big++; else small++; } END { print small, big; }'
18783 668

(Though of course the SQLite file size is only an approximation of the corresponding wiki page size, especially as we haven’t decided on any particular format yet.)

Moreover pagepiles are static -- which means that they only output at one point, but the options for updating it mean you have to "regenerate" the list as a new output, even if its just an extension of the previous list (there is not "list history" so to speak -- instead each pile is a separate independent object). This means that "list owners" can't lean on these lists too update multiple other locations -- instead you have to recreate the list and manually update its use elsewhere. This kind of application is partially being solve by tools like Listeria, but that is entirely dependent on the Wikidata Query Service, but that only includes a subset of the criteria that you might use to form a list (with a tools like PattyPan, Quarry or Manual work).

Technically, anyone with Toolforge shell access can update a PagePile, I believe – they’re world-writable SQLite files. In the PHP API, this seems to be somewhat intended – you first create a fresh pile and then fill it with data, and nothing stops you from filling it with more data later. That doesn’t seem to be exposed through the web API, though.


I wonder if we need/want external storage for this at all. Could we simply store a list of pages on some wiki, either in Wikitext or as JSON? This gives us a history, a talk page, rate limiting, patrolling, authenticated editing (OAuth), and probably more, almost for free.

For sure, I think .tab files on Commons could probably be a good fix for something like this, but we would also need a standardized data format, and an API/framework for interacting with the portable list -- its not clear to me if this needs to be a "new" outside the platform tool -- its actually almost preferable for it to live within the platforms (for all the reasons you highlight). That is in part why I wanted to point to T96147 -- its an in platform solution that could probably be scalable in a reasonable way.

Also realizing that I never pinged @Magnus in the first note :P

That said, only ~3½% of piles are that large:

lucaswerkmeister@tools-sgebastion-07:~$ find /data/project/shared/pagepile/ -type f -printf '%s\n' | awk '{ if ($1 >= (2048*1024)) big++; else small++; } END { print small, big; }'
18783 668

(Though of course the SQLite file size is only an approximation of the corresponding wiki page size, especially as we haven’t decided on any particular format yet.)

Yeah, and, in my opinion, the goal wouldn't be to erase Pagepile, or other big set tools -- it would just be to create something that is a bit more organic and integrated into the platforms so that the "average Wikimedian" can work with such lists on the platform. Big sets are probably going to continue to be best handled by something like PagePile, and lists that don't have a need to be reused or updated would probably work well with pagepile.

OK, some initial thoughts and remarks on this:

  • I have actually rewritten Listeria in Rust, to use the Commons Data: namespace (aka .tab files) to store the lists, and use Lua to display them.
  • I think the Commons Data: namespace would technically work for a generalized "list storage", thugh it seems to be a bit of abandonware (will this feature be long-term supported by the WMF?)
  • Commons Data: namespace, if supported, would also have the proper scaling, caching etc. that PagePile is lacking
  • It should, in principle, be possible to change PagePile to write new piles to the Commons Data: namespace, and return queries from there. That would give the new list storage a running start. We can replace PagePile later.
  • Drawbacks of Commons Data: namespace are (a) cell size limit (400 characters, so should work for simple page lists), and (b) total page size (thus limiting the max list length)
  • If Labs were to offer a scalable, backed-up object store for tools, that might be better suited for general list management
  • Much of the "average Wikimedian" integration will have to come from (user-supplied) JavaScript, such as "snapshot this category tree" or something. I doubt waiting for WMF would be a timely solution.
  • Short term, we (I?) could write a slim web API on Labs that abstracts the implementation away, offering a to-be-discussed set of functions (create/amend/remove list etc). Initially, this could run on PagePile in the background, or Commons Data: namespace, or even both (large lists go to pagepile, short ones into a MySQL database or Commons Data: namespace, etc.)

Started some design notes of such a product: https://meta.wikimedia.org/wiki/Gulp

Ooooh, thank you Magnus, that's a really great first pass at thinking about that. For the reuse by something like Listeria or tabernacle would it make sense to store the associated Wikidata item (or also commons MediaInfo id?) with the page ? (so that you wouldn't have to query those pages for them to do things like add properties?) One thing that kindof "fails" for me in the user experience of the current Petscan->Pagepile->Tabernacle workflows (and I am thinking this might be true in other workflows as well) is that the endtool expects _only_ Wikidata items, so if I don't generate a Wikidata list first, the tool either needs to have code to retrieve that or you have to generate a new list. If there was a second column with the optional Wikidata id, it would probably make lists made with one wiki in mind more portable.

Started some design notes of such a product: https://meta.wikimedia.org/wiki/Gulp

For the List data structure, in addition too or as part of the Description: would it make sense to require a field for "Source" of the data (i.e. Petscan id, shorturl for query, etc) so that anyone "seeing" the pile could go to it, recreate the query/input, and modify it? (kindof like how folks use the Listeriabot lists).

Minimum viable product

  • Import from various sources
    • All sources offered in PagePile
  • Export to various places
    • All consumers offered in PagePile

How is this supposed to work? As far as I can tell, these imports and exports would have to go through the PagePile tool in some form, so to me these read like requirements that can only be fulfilled by one person: the PagePile maintainer.

Minimum viable product

  • Import from various sources
    • All sources offered in PagePile
  • Export to various places
    • All consumers offered in PagePile

How is this supposed to work? As far as I can tell, these imports and exports would have to go through the PagePile tool in some form, so to me these read like requirements that can only be fulfilled by one person: the PagePile maintainer.

I'll let him know ;-)

I was more thinking of New Tool X long-term replacing PagePile, so the "target" tools will have to support X. We can probably pull as much as PagePile can from available sources.

Started some design notes of such a product: https://meta.wikimedia.org/wiki/Gulp

For the List data structure, in addition too or as part of the Description: would it make sense to require a field for "Source" of the data (i.e. Petscan id, shorturl for query, etc) so that anyone "seeing" the pile could go to it, recreate the query/input, and modify it? (kindof like how folks use the Listeriabot lists).

Already penciled in under "Datasources" as "Optional:Source"

Started some design notes of such a product: https://meta.wikimedia.org/wiki/Gulp

Ooooh, thank you Magnus, that's a really great first pass at thinking about that. For the reuse by something like Listeria or tabernacle would it make sense to store the associated Wikidata item (or also commons MediaInfo id?) with the page ? (so that you wouldn't have to query those pages for them to do things like add properties?) One thing that kindof "fails" for me in the user experience of the current Petscan->Pagepile->Tabernacle workflows (and I am thinking this might be true in other workflows as well) is that the endtool expects _only_ Wikidata items, so if I don't generate a Wikidata list first, the tool either needs to have code to retrieve that or you have to generate a new list. If there was a second column with the optional Wikidata id, it would probably make lists made with one wiki in mind more portable.

There are some possibilities:

  • "fixed" second column
  • part of metadata
  • on-the-fly conversion when requesting data ("convert to wikidata/frwiki" etc)

This is great, thank you for jumping on this @Magnus !

  • Much of the "average Wikimedian" integration will have to come from (user-supplied) JavaScript, such as "snapshot this category tree" or something. I doubt waiting for WMF would be a timely solution.
  • Short term, we (I?) could write a slim web API on Labs that abstracts the implementation away, offering a to-be-discussed set of functions (create/amend/remove list etc). Initially, this could run on PagePile in the background, or Commons Data: namespace, or even both (large lists go to pagepile, short ones into a MySQL database or Commons Data: namespace, etc.)

A concrete example of "average Wikimedian" integration is the GSoC project @Astinson mentioned above, so that's a use case we can take into account when designing the MVP. I could even see the worklist tool being *the* Gulp frontend, since it should provide a UI for creating and viewing lists anyway, and only needs to store some additional entry metadata (who's working on this entry, how much progress has been made...) and allow that metadata to be edited by multiple people.

That is to say, if we want to converge that way, I'd argue some API should be part of the Gulp MVP to support that, and I (or maybe @Meghasharma213) can chime in on the requirements for that, and also help build it.

Another idea came to me:
What is it's not just "page lists", but any (general, of one of pre-defined types) tables?
One table type would be "page title/page namespace", giving us the above lists.
Others could be, say, Mix'n'match catalogs ("external ID/url/name/description/instance of").

Another idea came to me:
What is it's not just "page lists", but any (general, of one of pre-defined types) tables?
One table type would be "page title/page namespace", giving us the above lists.
Others could be, say, Mix'n'match catalogs ("external ID/url/name/description/instance of").

This may have a lot of interesting applications. I mean at the very least you have two well defined ones there, and if the tool is designed for that flexibility from the beginning, I could imagine a number of other applications (i.e. doing something with making geo-shapes, annotations, etc). Just as you think about this, I would make sure that we aren't losing the core application of pushing this data into on-wiki use cases.

Adding a few more people who might have ideas on applications or needs: @Fuzheado, @Multichill @SandraF_WMF @Lokal_Profil @Yarl

Adding to the WMSE-Tools-for-Partnerships-2019-Blueprinting tracking board, because we are looking whether something like this would be in scope for that work.

FWIW I started developing an new tool called GULP, now under early development.