Page MenuHomePhabricator

DBA review of Linter extension
Closed, ResolvedPublic

Description

Hi DBAs, I would appreciate a review of the Linter extension before it is deployed. It creates a database table, which I've copied the definition of below (also at mediawiki/extensions/Linter, linter.sql file)

CREATE TABLE /*_*/linter (
	-- primary key
	linter_id int UNSIGNED AUTO_INCREMENT PRIMARY KEY not null,
	-- page id
	linter_page int UNSIGNED not null,
	-- error category
	linter_cat VARCHAR(30) not null,
	-- extra parameters about the error, JSON encoded
	linter_params blob NOT NULL
) /*$wgDBTableOptions*/;

-- Query by page
CREATE INDEX /*i*/linter_page ON /*_*/linter (linter_page);

Each row represents one "lint error", there can be multiple lint errors per page (even for the same category). The unique part of it is technically the 'location' field in the JSON encoded params + linter_page + linter_cat, but because params are a blob, I just added an autoincrement primary key. The uniqueness is enforced in the code.

I used a JSON blob to store the data because it can be arbitrary things that come out of the wikitext (position of text, template name, etc.) so it wouldn't have fit in VARCHAR.

If there's any more information that would be useful in reviewing, please let me know. T148583 has details on how we plan to use sampling to avoid flooding the databases when doing the initial population.

Event Timeline

@Legoktm, here a few personal questions/comments, not a DBA review 😉

  1. All the queries against this table will be performed by linter_page?
  2. When an article is edited, all the linter rows of that article will be deleted and replaced with the new linter errors, if any? If not, how it works?
  3. Based on the answer of the above question and given that there can be multiple linter rows per article, int might be small as primary key in the long term.
  4. What happens if a widely used templates is modified adding/fixing many linter errors? Could this cause a huge number of rows to be written/deleted in cascade?
  5. How is the code enforcing the uniqueness with concurrent edits from different hosts? If the only missing parameter for the uniqueness is the location, have you considered extracting it from the JSON and saving it as a column so that the uniqueness could be enforced at DB level?
  • error category

linter_cat VARCHAR(30) not null,

This seems like a clear candidate to normalize on a separate table with a key, isn't it?

linter_params seems ok, but please make sure you are not going to just store '{line: 33}'

This seems like a good candidate to deploy initially to x1, and only later, when things are stable, move it back to the s* shards?

That will also avoid any labs issue. I assume all data here is 100% public, and could be published with no problem (no user information on the blob, etc.)? Please confirm otherwise. This is the most critical issue.

On the previous comments by Volans: undesigned int is the standard size for our PKs, and we will probably keep it that way for now; increase all of them at the same time when necessary. He brings a good question, though, should the foreign key here a page or a revision, and how writes are going to be throttled to avoid 1000 new rows for each parse action?

@Legoktm, here a few personal questions/comments, not a DBA review 😉

  1. All the queries against this table will be performed by linter_page?

Kind of. Some of the queries do a join on page and will query by page_namespace, for example "get all lint issues in the article namespace" (that specific query is currently only exposed via the API, but I plan to add it to the UI as well).

We also do a query by only linter_cat (get all lint issues of this type), so we do need an index on linter_cat too.

  1. When an article is edited, all the linter rows of that article will be deleted and replaced with the new linter errors, if any? If not, how it works?

Basically yes. However, we do a diff in the code, and only delete those that are not present in the new set of errors, and insert those that weren't already present in the database. Code is at https://github.com/wikimedia/mediawiki-extensions-Linter/blob/master/includes/Database.php#L134 if that's easier to understand.

  1. Based on the answer of the above question and given that there can be multiple linter rows per article, int might be small as primary key in the long term.

Ack.

  1. What happens if a widely used templates is modified adding/fixing many linter errors? Could this cause a huge number of rows to be written/deleted in cascade?

Yes. I'll address flood prevention below more-generally.

  1. How is the code enforcing the uniqueness with concurrent edits from different hosts? If the only missing parameter for the uniqueness is the location, have you considered extracting it from the JSON and saving it as a column so that the uniqueness could be enforced at DB level?

Since you asked, I've been thinking about it more, and I think splitting it into two columns (location start, end) is a better and safer solution. I'll work on that.

  • error category

linter_cat VARCHAR(30) not null,

This seems like a clear candidate to normalize on a separate table with a key, isn't it?

Good point. I'll implement that.

linter_params seems ok, but please make sure you are not going to just store '{line: 33}'

Hmm. If we remove location from this field (as discussed above), then we end up with something like:
{"name":"big","templateInfo":{"name":"very big"}} Now, in some (maybe many? I'm not sure how many) cases there will be no templateinfo field, so it's just {"name":"big"} which is basically what you wanted to avoid.

The JSON blob is more flexible in case we want to add more data in the future, but right now I don't have any concrete plans, just some vague ideas.

This seems like a good candidate to deploy initially to x1, and only later, when things are stable, move it back to the s* shards?

Please correct me if I'm wrong, but if this is on x1, that means we can't join with the page table right? Right now the API and special page queries are designed to do one batch query of page info and lint error info. We could split it into two queries probably to go onto x1.

That will also avoid any labs issue. I assume all data here is 100% public, and could be published with no problem (no user information on the blob, etc.)? Please confirm otherwise. This is the most critical issue.

Yes, all public data :)

On the previous comments by Volans: undesigned int is the standard size for our PKs, and we will probably keep it that way for now; increase all of them at the same time when necessary. He brings a good question, though, should the foreign key here a page or a revision, and how writes are going to be throttled to avoid 1000 new rows for each parse action?

I think the best way to do this will be to use the job queue. Instead of doing the writes when Parsoid tells MW there are new lint errors, defer it into the job queue, which has a shared pipeline across requests, will allow better throttling, and is already tested against massive insertion upon template updates.

should the foreign key here a page or a revision

Missed this part. It should be a page because it isn't revision. More clearly, the lint errors are based on the expanded page content, which includes templates and stuff, and isn't contained in the revision id. The downside is that after an edit, it may take a few minutes to clear the now-fixed lint errors, but I think that is acceptable.

I just added an autoincrement primary key

If you are going to write and delete multiple rows everytime things are parsed, an autoincrement will likely overflow, and not sure how useful it will be. Maybe page id + error number as a PK, which may also provide ordering?

These leads to an important warning: there has been, and still is issues with operations that replace the same rows every time there is an edit (such as refreshlink jobs). Make sure you sync with performance for some tips on how to avoid issues- very small transactions (even 1 transaction per row deleted and per row added), locking and grouping of refreshing those rows (only update the the latest edit), perform a mutex on the operation, do not delete existing rows and replace with the same info, etc.

I would also recommend to start with a very small wiki and grow very slowly, there is always deployment issues for one reason or another- make sure you go at the right pace.

I just added an autoincrement primary key

If you are going to write and delete multiple rows everytime things are parsed, an autoincrement will likely overflow, and not sure how useful it will be. Maybe page id + error number as a PK, which may also provide ordering?

No, only if there is a change such as the error being fixed or a new one being added will DB writes happen.

Ok, I think I've addressed all of the feedback, and would appreciate a re-review:

CREATE TABLE /*_*/linter (
	-- primary key
	linter_id int UNSIGNED PRIMARY KEY not null AUTO_INCREMENT,
	-- page id
	linter_page int UNSIGNED not null,
	-- error category (see CategoryManager::$categoryIds)
	linter_cat int UNSIGNED not null,
	-- start and end positions of where the error is located
	linter_start int UNSIGNED not null,
	linter_end int UNSIGNED not null,
	-- extra parameters about the error, JSON encoded
	linter_params blob NOT NULL
) /*$wgDBTableOptions*/;

-- Query by page
CREATE INDEX /*i*/linter_page ON /*_*/linter (linter_page);
-- Query by category
CREATE INDEX /*i*/linter_cat ON /*_*/linter (linter_cat);
CREATE UNIQUE INDEX /*i*/linter_cat_page_position ON /*_*/linter (linter_cat, linter_page, linter_start, linter_end);

Changes:

  • linter_cat is now a int, and the mapping from string <==> int is just in PHP since it's pretty trivial. There's additionally an index on linter_cat so we can do query by category.
  • linter_start and linter_end are no longer embedded in linter_params. There is a unique index to enforce the constraint of cat/page_id/start/end. Note that we don't ever do queries based on start/end, it's just for unique-ness.
  • Note that linter_id is still just an int. Should we change it to a bigint now on the assumption it might not be big enough? Or leave it as is?

Insertions now happen via the job queue, so we can have better ability to throttle and prevent floods/DoSs.

Next steps I'd like to take if this is okay from your perspective: Deploy to beta cluster before the end of the year. I doubt there's enough data there for it to cause any issues, but I'll observe and tweak as necessary. Starting in January I'd like to start rolling it out (slowly!) to small production wikis and increasing/tweaking/etc. as necessary working with DBAs/releng/etc to make it successful.

CREATE INDEX /*i*/linter_cat ON /*_*/linter (linter_cat);
CREATE UNIQUE INDEX /*i*/linter_cat_page_position ON /*_*/linter (linter_cat, linter_page, linter_start, linter_end);

These 2 indexes are redundant, keep only the second one (left-most prefix will give you the first for free). One automatic tool to run that will get these kind of mistakes is pt-duplicate-key-checker (not on production, of course).

I will do a full review a bit later.

These 2 indexes are redundant, keep only the second one (left-most prefix will give you the first for free). One automatic tool to run that will get these kind of mistakes is pt-duplicate-key-checker (not on production, of course).

Thanks, I didn't know about that in mysql and for the tip about the tool. Fixed in https://gerrit.wikimedia.org/r/#/c/325508/.

Assuming linter_params won't go crazy (it is really extra info of a reasonable size, and not hidden structured data, I do not see any reason to block this deployment to production. You may want legal/security ok if you want to make this table available on labs.

This is resolved to me.

I confirm for the security team that its fine to replicate this table to labs.

Legoktm claimed this task.

Thanks all!