Page MenuHomePhabricator

Get code review for PageAssessments extension done
Closed, ResolvedPublic

Description

Pre-requisite for deploying to production

Event Timeline

Niharika created this task.Dec 3 2015, 4:32 PM
Niharika updated the task description. (Show Details)
Niharika raised the priority of this task from to Normal.
Niharika claimed this task.
Niharika added a project: Community-Tech.
Niharika renamed this task from Get code review done for PageAssessments extension done to Get code review for PageAssessments extension done.
Niharika set Security to None.
Niharika added subscribers: DannyH, kaldari, Niharika, Aklapper.
kaldari closed this task as Resolved.Jan 19 2016, 4:14 PM
DannyH moved this task from Bug backlog to Archive on the Community-Tech board.Jan 19 2016, 9:04 PM

This task is a bit difficult to follow. Why is a separate database table being used? Was the core page_props database table insufficient?

MZMcBride reopened this task as Open.Jan 25 2016, 6:29 PM

I'm re-opening this task for now. I'd like to know who did the code review here and what suggestions were made.

Briefly looking at https://github.com/wikimedia/mediawiki-extensions-PageAssessments/blob/master/db/addReviewsTable.sql, the database schema seems a bit weird. For example, we typically use page_title, not page_name.

I don't know why we'd need to store pa_page_title and pa_page_namespace if we're already storing pa_page_id. Duplicating the data seems like an odd decision.

pa_class and pa_importance seem very specific to a particular use-case. Do we really want this extension to be so specific?

If the idea is to have per-revision tagging instead of per-page tagging, we have a core change_tag database table already. Can that be used?

I'd like to see a more thorough architecture review of this extension/idea before it gets deployed.

@MZMcBride: Both page_props and change_tag essentially store 1-dimensional information, although you can overload them by storing JSON objects in the blobs. Overloading them is a bad idea, however, as you can't easily search or index arbitrary blobs. (This is why Wikidata only has the simplest possible API capabilities; all the data is stored in arbitrary blobs rather than dedicated relational tables.) Since we have the luxury of knowing what kind of data we want to store (WikiProject assessment data), we can build a dedicated well-indexed table for it. The other disadvantage of using page_props is that the page_props table is extremely bloated on English Wikipedia. Adding a few million more rows to it wouldn't help things.

Regarding the schema, I would love to have more input from @Springle or @aaron. We asked them to take a look at it a while back, but most of the discussion was about how to handle the large number of initial database writes. I don't have a strong opinion on the denormalization of the page data, but you're probably right that we don't need it.

More discussion can be found at T117142.

kaldari added a comment.EditedJan 25 2016, 9:11 PM

pa_class and pa_importance seem very specific to a particular use-case. Do we really want this extension to be so specific?

We've only been asked to support a single use-case, but if you know of other use-cases, we would definitely be interested in supporting them if possible. Basically this extension could be used for associating any 2-dimensional data with articles. (We already have categories and page properties for associating 1-dimensional data.) Keep in mind that we don't want to abstract it too much, however, as then we are basically creating Wikidata-light. We want it to be as simple and performant as possible for a specific set of use-cases.

DannyH added a comment.Feb 2 2016, 7:57 PM

@kaldari Should this ticket be marked as closed again?

kaldari closed this task as Resolved.Feb 2 2016, 10:31 PM

Yes, all the code has been reviewed. Also, the schema has been updated at https://gerrit.wikimedia.org/r/#/c/267638/. Any further issues should be opened as separate tickets.

@MZMcBride: Both page_props and change_tag essentially store 1-dimensional information, although you can overload them by storing JSON objects in the blobs. Overloading them is a bad idea, however, as you can't easily search or index arbitrary blobs. (This is why Wikidata only has the simplest possible API capabilities; all the data is stored in arbitrary blobs rather than dedicated relational tables.)

For what it's worth, this isn't true at all. Many database engines can now index JSON blobs. Some newer database engines basically do this exclusively.

The other disadvantage of using page_props is that the page_props table is extremely bloated on English Wikipedia. Adding a few million more rows to it wouldn't help things.

What does bloated mean? On the English Wikipedia, pagelinks and revision both have hundreds of millions of rows. Who cares?

pa_class and pa_importance seem very specific to a particular use-case. Do we really want this extension to be so specific?

We've only been asked to support a single use-case, but if you know of other use-cases, we would definitely be interested in supporting them if possible. Basically this extension could be used for associating any 2-dimensional data with articles. (We already have categories and page properties for associating 1-dimensional data.) Keep in mind that we don't want to abstract it too much, however, as then we are basically creating Wikidata-light. We want it to be as simple and performant as possible for a specific set of use-cases.

How would you add additional properties in the future? You'd keep adding database columns? If you want to talk about inefficiency, requiring ALTERs to a table (times 800 for Wikimedia wikis) every time you want to expand the table to meet additional requirements is actually inefficient.

kaldari added a comment.EditedFeb 3 2016, 4:50 PM

For what it's worth, this isn't true at all.

Sorry, but it is true that it isn't easy to search or index arbitrary blobs. If you're just going to accuse me of lying (or else having no idea what I'm talking about), I don't think it's worth my time to respond to your comments. First of all Wikipedia doesn't run the latest and greatest database software. We're currently running MariaDB 10.0 in production. Our software is also expected to run, in theory, on MySQL 5.0.2 and Postgres 9.0. MariaDB supports indexing of JSON through the use of virtual columns, but there are several problems with this. The implementations are incompatible in some cases between MariaDB and MySQL. Plus many popular SQL tools such as mysqldump, phpMyAdmin, SQLyog, etc. don't know how to handle these yet. Whatever database features we use have to be supported by our entire operations and development infrastructure and should be easy for 3rd parties to support as well. Furthermore, MediaWiki code almost never interacts with the database directly. We use an abstraction layer built into the core MediaWiki software which doesn't yet have support for such features (as far as I can tell).

What does bloated mean? On the English Wikipedia, pagelinks and revision both have hundreds of millions of rows. Who cares?

Table bloat has real effects on Wikipedia development and performance, especially for APIs. Even for deferred queries it causes problems. For example, Special:GadgetUsage on English Wikipedia doesn't show stats for active users (like it does on other wikis) because the query takes over 24 hours to run due to the bloating of the en.wiki user_properties table. Having bloated tables is especially bad for APIs since these are supposed to return results in real-time. The primary use of the PageAssessments extension is providing API access to the assessment data. Thus making that performant is a high priority.

How would you add additional properties in the future? You'd keep adding database columns?

Yes. This wouldn't happen very often, so the inefficiency is pretty much negligible.

@MZMcBride: These discussions are getting hard to keep track of. Please direct additional comments about the architecture, implementation, and/or need for this extension to T120219.