⚓ T202596 Write our anticipated "phase two" schemas and submit for review

Subject	Repo	Branch	Lines +/-
Maintenance scripts for judgment indexes	mediawiki/extensions/JADE	master	+442 -0
Secondary schema for JADE indexes	mediawiki/extensions/JADE	master	+63 -0
Hooks to maintain judgment link tables	mediawiki/extensions/JADE	master	+425 -3
Link table model	mediawiki/extensions/JADE	master	+451 -0
Secondary schema for JADE indexes	mediawiki/extensions/JADE	master	+63 -0
Drop page judgments for this release	mediawiki/extensions/JADE	master	+0 -63

Status	Assigned	Task
Declined	None	T212435 Review real-world query plans and performance for Jade
Declined	None	T238877 Write Huggle labels to Jade
Declined	calbon	T183381 Deploy pilot of Jade to a small set of wikis.
Resolved	awight	T196547 [Epic] Extension:JADE scalability concerns
Resolved	awight	T202596 Write our anticipated "phase two" schemas and submit for review

In T202596#4600586, @jcrespo wrote:
That looks really bad performance. Not only that scans the revision table from top to bottom (>200GB of data) making it slow, it is nondeterministically slow- it will be faster or slower depending on the parameters and existing data.

You can emulate it by running (don't run it, it doesn't finish on production and you may not be able to kill it):
root@db1089[enwiki]> select revision.rev_id, page.page_title from revision left join page     on page.page_title = concat('Diff/', revision.rev_id) where    page.page_namespace = 4 order by revision.rev_id desc limit 100;
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted
Does this seem like a good alternative to maintaining a link table between rev_id and judgment_page.page_id?

Please create a suitable schema with simple queries. Queries that do more complex stuff than point selects using primary keys regarding the revision table will just not work on production, with very close to 1 billion rows. Please test your queries on the wikirreplicas to check they are suitable for production.

Please CC in the future @Marostegui and @mark.

Presumably, @awight meant to put the judgment_page.page_namespace = 810 in the ON clause, not the where clause (Otherwise, I assume he would have used an inner join, or a more obvious IS NOT NULL in the where). Using @jcrespo 's example, that would be roughly like:

select revision.rev_id, page.page_title
from revision
left join page on page.page_title = concat('Diff/', revision.rev_id) AND page.page_namespace = 4
order by revision.rev_id desc
limit 100;

which does not do a full table scan.

@Bawolff That new query you propose makes no sense to me- it just selects the first 100 revisions every single time(e.g. revision id 1 to 100, if they all existed). But sure, if they want to select that (ids and a bunch of NULLs), I don't see any problem with doing that one (although I would batch by rev id without using a limit for more deterministic speed), I am just not sure that is what they *really* want.

Change 461825 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Drop page judgments for this release

https://gerrit.wikimedia.org/r/461825

gerritbot added a project: Patch-For-Review.Sep 20 2018, 11:39 PM

awight mentioned this in rEJADbcce59437a7a: Drop page judgments for this release.Sep 20 2018, 11:40 PM

Thanks for all the attention given to this, and apologies for thinking that the namespace condition would behave the same in the WHERE as in the ON. The heart of what I want to ask is about this condition, though:

page.page_title = concat('Diff/', revision.rev_id)

My instinct is to just use a simple, indexed join table and join directly on keys, but I've heard rumors (elsewhere) that the join on a calculated value is a reasonable approach. Does this sound right?

It's a little bit hard to understand the query in P7570 (for example do you mean page_title.judgment_page instead of judgment_page.page_title?) but I can suggest trying to select from jade tables (as they are smaller) and then join them with revision table specially since we are using the PK index. It's worth noting that from my basic knowledge and according to my bible, It should not matter whether you join jade with revision or the way around and the optimizer should understand and changes order of the join but in reality things might be different and it's better not to risk it.

Change 461825 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Drop page judgments for this release

https://gerrit.wikimedia.org/r/461825

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)).Sep 24 2018, 5:00 PM

Change 456078 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Secondary indexes for JADE pages

https://gerrit.wikimedia.org/r/456078

awight mentioned this in rEJAD5aa947e5b76e: Secondary indexes for JADE pages.Sep 27 2018, 8:01 AM

awight mentioned this in rEJAD4b6984bfdd93: Secondary indexes for JADE pages.Sep 27 2018, 8:21 AM

In T202596#4604545, @awight wrote:
Thanks for all the attention given to this, and apologies for thinking that the namespace condition would behave the same in the WHERE as in the ON. The heart of what I want to ask is about this condition, though:
page.page_title = concat('Diff/', revision.rev_id)
My instinct is to just use a simple, indexed join table and join directly on keys, but I've heard rumors (elsewhere) that the join on a calculated value is a reasonable approach. Does this sound right?

@awight would that still part of the query suggested at T202596#4600642?
As Jaime is, I am also too confused with that query, it doesn't make a full scan anymore, but it will always give the same results, could you clarify that?

As pointed out earlier, in order to try to play around with possible queries and see how fast or slow they run, using the wikireplicas can be a good idea, as they have pretty much all the data we have in production (some fields are redacted for privacy of course), but it could be a good way for you to speed up the process of "query reviewing" as you can try different approaches without waiting for the DBAs to check in production. You can play around with different JOINs or WHERE clauses and see the effect on the query time.

Thanks!

awight mentioned this in rEJAD5c8aaafe99c3: Secondary indexes for JADE pages.Sep 27 2018, 2:46 PM

P7609 (An Untitled Masterwork)

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3738 39 40 41 42 43 44 454647 48 49 50 51 52 53

class="phui-tag-core phui-tag-color-person">@Marostegui I'm not sure if this helps, but I'll try to better illustrate my question using a real-world example. Here's the main query from Special:RecentChanges, with a JADE join added in patch format.
class="paste-embed-body" style="max-height: 27.6em;">

1

SELECT /* SpecialRecentChanges::doMainQuery */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_minor, rc_bot, rc_new, rc_cur_id, rc_this_oldid, rc_last_oldid, rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, rc_deleted, rc_logid, rc_log_type, rc_log_action, rc_params, rc_comment AS `rc_comment_text`, NULL AS `rc_comment_data`, NULL AS `rc_comment_cid`, rc_user, rc_user_text, NULL AS `rc_actor`, wl_user, wl_notificationtimestamp, page_latest, (SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_rc_id=rc_id ) AS `ts_tags`, class="gi">+ jaded_judgment FROM `recentchanges` LEFT JOIN `watchlist` ON (wl_user = '1' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) class="gi">+LEFT JOIN `jade_diff_judgment` class="gi">+ ON rc_this_oldid = jaded_revision WHERE rc_bot = '0' AND (rc_timestamp >= '20180924183425') AND rc_new IN ('0','1') ORDER BY rc_timestamp DESC LIMIT 50

At the beginning of the discussion here, I was thinking that we might be able to do the join using a calculated field, but I no longer feel like that's an alternative we should consider. It's already clear that it would require an additional join on the page table, which is expensive. Feel free to cast some light!

awight mentioned this in rEJADdbeae999d2fb: Secondary indexes for JADE pages.Oct 2 2018, 9:07 PM

awight mentioned this in rEJADf7ca0b6bcb37: Secondary indexes for JADE pages.Oct 2 2018, 9:11 PM

awight mentioned this in rEJAD5563e1d8d373: Secondary indexes for JADE pages.Oct 3 2018, 7:51 PM

awight mentioned this in rEJAD89022b57b025: Secondary indexes for JADE pages.Oct 3 2018, 8:44 PM

awight mentioned this in rEJAD28ccc1e916a6: Secondary indexes for JADE pages.Oct 3 2018, 9:46 PM

awight mentioned this in rEJAD286481e486d2: Secondary indexes for JADE pages.Oct 3 2018, 10:39 PM

awight mentioned this in rEJAD511bd98de47c: Secondary indexes for JADE pages.Oct 3 2018, 10:54 PM

awight mentioned this in rEJADe2b7dcb1d8f7: Secondary indexes for JADE pages.Oct 3 2018, 11:35 PM

awight moved this task from In Progress to Review on the Jade board.Oct 4 2018, 7:42 PM

awight moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

awight updated the task description. (Show Details)

Change 466804 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Secondary schema for JADE indexes

https://gerrit.wikimedia.org/r/466804

Change 466806 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Hooks to maintain judgment link tables

https://gerrit.wikimedia.org/r/466806

Change 466804 abandoned by Awight:
Secondary schema for JADE indexes

Reason:
redundant

https://gerrit.wikimedia.org/r/466804

Change 466808 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Maintenance scripts for judgment indexes

https://gerrit.wikimedia.org/r/466808

awight mentioned this in rEJAD839fe7e5fb67: Secondary schema for JADE indexes.Oct 11 2018, 11:06 PM

awight mentioned this in rEJADb0e25e519dc8: Maintenance scripts for judgment indexes.

awight mentioned this in rEJAD95fe31d6536d: Secondary schema for JADE indexes.

awight mentioned this in rEJAD0246c5e554fb: Hooks to maintain judgment link tables.

awight mentioned this in rEJADa3b5443e4e12: Maintenance scripts for judgment indexes.

awight mentioned this in rEJADc29e769d16b2: Hooks to maintain judgment link tables.

awight mentioned this in rEJADd7e2e743bf54: Hooks to maintain judgment link tables.

awight mentioned this in rEJAD30067413418a: Maintenance scripts for judgment indexes.

@Marostegui These are the proposed indexes, if you want to discuss something concrete:
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/JADE/+/456078/

One questionable decision worth highlighting is that I've split the link table into two tables for "Diff" and "Revision" entity types. This avoids polymorphism, in other words a column that always must be matched like jade_entity_type = 'Diff'. The different judgment types are queried in distinct use cases, so I don't see any benefit in having them stored in a single table for any "revision-like" judgment. Also note that we'll be supporting "Page" judgments in the near future, which target page_id and therefore would require a different table structure and indexes regardless. If this seems wrong, I'm happy to reconsider.

awight mentioned this in rEJADba7ba2eef59b: Maintenance scripts for judgment indexes.Oct 11 2018, 11:36 PM

awight mentioned this in rEJAD967d787874a4: Hooks to maintain judgment link tables.

awight mentioned this in rEJAD849e35588d6a: Hooks to maintain judgment link tables.Oct 12 2018, 1:25 AM

awight mentioned this in rEJAD996133029148: Maintenance scripts for judgment indexes.

awight mentioned this in rEJAD42c4bf8f8ccb: Secondary schema for JADE indexes.Oct 16 2018, 3:45 PM

awight mentioned this in rEJADa30129d5f422: Hooks to maintain judgment link tables.

awight mentioned this in rEJAD9495645df468: Maintenance scripts for judgment indexes.

awight mentioned this in rEJADb26390283a24: Secondary schema for JADE indexes.Oct 16 2018, 6:47 PM

awight mentioned this in rEJAD87c04a1a399e: Hooks to maintain judgment link tables.

awight mentioned this in rEJAD70bc20b060ac: Maintenance scripts for judgment indexes.

awight mentioned this in rEJAD6366391bd1c8: Maintenance scripts for judgment indexes.Oct 16 2018, 6:50 PM

awight mentioned this in rEJAD54952ebc6fdc: Secondary schema for JADE indexes.Oct 16 2018, 7:35 PM