Assist with maintaining aggregate values in numerical tables
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Esanders
	Mar 17 2020, 5:06 PM

Description

From: https://www.facebook.com/groups/wikipediaweekly/permalink/2735507239830423/

@Doc_James wrote:
"Keeping our COVID19 numbers uptodate is a pain. We need a "SUM" tool for tables within Wikimedia. https://en.wikipedia.org/wiki/Template:2019%E2%80%9320_coronavirus_outbreak_data"

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T247875 Assist with maintaining aggregate values in numerical tables
		Resolved		ppelberg	T247877 Show sum/average when selecting multiple numeric cells in VE

Event Timeline

Esanders created this task.Mar 17 2020, 5:06 PM

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptMar 17 2020, 5:06 PM

Jdlrobson added a project: Web-Team-Backlog (Tracking).Mar 17 2020, 5:40 PM

Bawolff subscribed.Mar 17 2020, 5:55 PM

See also T132298 and T120480

Daniel_Mietchen updated the task description. (Show Details)Mar 20 2020, 1:38 AM

Daniel_Mietchen subscribed.

ppelberg closed subtask T247877: Show sum/average when selecting multiple numeric cells in VE as Resolved.Mar 24 2020, 12:39 AM

ovasileva moved this task from Incoming to Analytics on the covid-19 board.Mar 25 2020, 12:00 PM

With T247877 closed you can now see the sum and average when selecting multiple numeric cells in VE.

Tgr mentioned this in T250065: Store tabular data in a format that's machine-readable and can be shared between wikis.Apr 13 2020, 10:32 AM

This would be a relatively straightforward task for a Lua script if the data were in some kind of machine-readable format. Filed T250065: Store tabular data in a format that's machine-readable and can be shared between wikis about that.

This depends on the data storage issue @eprodromou and @Tgr are working on

Or something could be done for a few specific tables. @kaldari looking into it (no longer than a day), will report back.

Abit moved this task from Incoming to Product Doing on the covid-19 board.Apr 13 2020, 5:54 PM

if the data were in some kind of machine-readable format

Yes, the hardest part of this is formatting and parsing numbers based on local formats, e.g. 1,000.00 vs 1.000,00

Abit assigned this task to kaldari.Apr 13 2020, 6:22 PM

In T247875#6052786, @Esanders wrote:

Yes, the hardest part of this is formatting and parsing numbers based on local formats, e.g. 1,000.00 vs 1.000,00

Formatting seems like a solved problem (formatnum etc). Parsing is hard in general; if the assumption is that data is stored in English Wikipedia and reused by other wikis so only the enwiki format needs to be understood, then it's manageable, I think.

Also Language::parseFormattedNumber could be exposed to Lua or whatever does the parsing. That should work assuming the numbers themselves are formatted with formatnum and not in some unreliable manual way.

@Doc_James - If the assumption is that the data is in Wikitext (and not JSON or something else), it seems like the best solution to this problem would be to generate the entire table from a single template, and have a Lua module calculate the totals based on the parameters passed to the template for each country. The big downside to this solution is that editors would no longer be able to use the VisualEditor table editor to edit the country data. And like Ed and Gergo mention above, dealing with number formatting is going to be a problem for any potential solution.

A cleaner long-term solution would be to keep all the data in JSON, but that would require tackling T248897 and also building some kind of transcludable table output in either the JsonConfig or Graph extension.

Two other options:

Write a wikitable parser in Lua. (This sounds horrible but is actually only mildly horrible. Find the start and end of the table body (these don't change so could be shown by a marker comment), tokenize for |, ||, |-, <ref>, </ref>, , and process the stream with a simple state machine (in the table / in a comment / in a reference). There are a number of other states in wikitext, like inside a template or another extenstion tag, but it's reasonable to assume those won't ever be used in these tables.) For every wikitable, write a module which is just an invocation of the parser on that table. Use mw.loadData to load that module (that ensures parsing is only done once per request). Provide something along the lines of {{#invoke:tablefunctions|sum|title=Template:SomeDataTable|column=5}} that calls the appropriate module, gets the parsed table and sums up the data. (That can be safely used inside the table - it doesn't parse the table, so there's no recursion.)
Use a machine-readable format which has sane diffs (JSON, CSV, Lua table). Accept it won't be editable via VisualEditor, instead create some one-off editing interface on Toolforge with OAuth-based edits. There are probably free Javascript libraries that provide spreadsheet-like behavior. References complicate things but can maybe hacked in somehow.

Tgr mentioned this in T250752: Lua-based COVID-19 data parsing proof of concept.Apr 20 2020, 9:10 PM

Tgr mentioned this in T248897: Create an easy-to-use editing interface for JSON tabular data pages.Apr 22 2020, 4:59 PM

In T247875#6053386, @kaldari wrote:

The big downside to this solution is that editors would no longer be able to use the VisualEditor table editor to edit the country data.

In T247875#6053514, @Tgr wrote:

Use a machine-readable format which has sane diffs (JSON, CSV, Lua table). Accept it won't be editable via VisualEditor

This will push more content to become more inaccessible to new contributors using VE (about 40% of first 100 edits are with VE), which I don't think is a good thing.

In T247875#6053514, @Tgr wrote:

Write a wikitable parser in Lua.

Instead of re-implementing one, maybe this could be exposed to Lua by the parser?

In T247875#6081657, @Esanders wrote:

This will push more content to become more inaccessible to new contributors using VE (about 40% of first 100 edits are with VE), which I don't think is a good thing.

...unless T248897 creates and equally user-friendly table data editor that VE can redirect users to.

kaldari reassigned this task from kaldari to Tgr.Apr 27 2020, 5:40 PM

In T247875#6081657, @Esanders wrote:

In T247875#6053514, @Tgr wrote:

Use a machine-readable format which has sane diffs (JSON, CSV, Lua table). Accept it won't be editable via VisualEditor

This will push more content to become more inaccessible to new contributors using VE (about 40% of first 100 edits are with VE), which I don't think is a good thing.

If there's a dedicated editing tool, it wouldn't really matter.

In T247875#6082115, @Esanders wrote:

...unless T248897 creates and equally user-friendly table data editor that VE can redirect users to.

That's about editing JSON pages on Commons, though. Currently the data is in wikitables in templates on enwiki. At a minimum, enwiki would have to enable tabular data pages, because enwiki editors are unlikely to buy into maintaining that data on a foreign wiki. And then we'd need a way to convert those JSON tables back into wikitables, and deal with granular source notations (cf T250919: Add row/cell annotations to tabular data). A better JSON editor would be cool, but I doubt it would help with short-term covid-19 data issues.

In T247875#6081657, @Esanders wrote:

In T247875#6053514, @Tgr wrote:

Write a wikitable parser in Lua.

Instead of re-implementing one, maybe this could be exposed to Lua by the parser?

Fully parsing the page seems like a lot of overhead and could result in recursion; and parser code is not easy to interact with. I have considered exposing tables during the course of normal parsing as some kind of metadata (extension data in the parser cache, for example) but tables can contain complex wikitext (including other tables) so the parser probably couldn't do anything better than treating the contents as pure wikitext or pure HTML (and even that might not be easy, given its reliance on regular expressions). So maybe that would be doable for numbers but references probably wouldn't survive the process.

ppelberg subscribed.Apr 30 2020, 5:30 PM

(This sounds horrible but is actually only mildly horrible. Find the start and end of the table body (these don't change so could be shown by a marker comment), tokenize for |, ||, |-, <ref>, </ref>, , and process the stream with a simple state machine (in the table / in a comment / in a reference). There are a number of other states in wikitext, like inside a template or another extenstion tag, but it's reasonable to assume those won't ever be used in these tables.)

I'm concerned that there's sufficient edge-cases in common table usage that it'd get difficult for us. Most difficult for this: tables can have cells with weird colspans/rowspans, or can be assembled from templates that provide entire rows/cells.

From a sum perspective, there's also lots of tables where the values aren't actually in a convenient format for us to sum up. E.g. in the linked coronavirus table, the numbers are actually e.g. {{formatnum:{{Sum|126787|0|-128|-12|-51}}}}. Again, the lua parser would have to get pretty smart about templates in order to do something useful here.

Obviously, we can say "this tool works on very tightly defined tables, where you can't use lots of common features, including {{formatnum}}"... but if we're restricting it that much then there's not really much of an advantage to sticking with wikitext rather than just going to JSON and having a structured editor there.

In T247875#6097954, @DLynch wrote:

I'm concerned that there's sufficient edge-cases in common table usage that it'd get difficult for us. Most difficult for this: tables can have cells with weird colspans/rowspans, or can be assembled from templates that provide entire rows/cells.

Colspan/rowspan is unusual in data tables. You are right about templates (some other tables I have checked had a relatively straighforward data format, though). Preprocessing the wikitext is a way to deal with that, but it makes parsing much more expensive and also opens up the possibility for recursion.

Obviously, we can say "this tool works on very tightly defined tables, where you can't use lots of common features, including {{formatnum}}"... but if we're restricting it that much then there's not really much of an advantage to sticking with wikitext rather than just going to JSON and having a structured editor there.

The advantage is that you don't need to convince hundreds of editors to adapt a new workflow involving a tool that has nowhere near feature parity. JSON is probably the sane long-term solution, I'm just skeptical about it being feasible in the short term.

In T247875#6104488, @Tgr wrote:

Colspan/rowspan is unusual in data tables.

It's not that unusual:

Abit moved this task from Product Doing to Product Backlog on the covid-19 board.May 11 2020, 5:46 PM

Jdlrobson moved this task from Untriaged to Untag on the Web-Team-Backlog (Tracking) board.Nov 12 2021, 10:16 PM

Jdlrobson removed a project: Web-Team-Backlog (Tracking).Dec 7 2021, 7:46 PM

Tgr removed Tgr as the assignee of this task.Aug 23 2022, 6:01 AM

Assist with maintaining aggregate values in numerical tablesOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Assist with maintaining aggregate values in numerical tables
Open, Needs TriagePublic
Actions

Related Objects
Search...