Page MenuHomePhabricator

Write our anticipated "phase two" schemas and submit for review
Closed, ResolvedPublic

Description

It was suggested in T200297 that Scoring Platform produce concrete schemas and example queries, to help move the review forward. That's a great idea!

Also provide an example implementation of how the secondary table update hooks might work, and bring up any concerns.

The proposed secondary schema is ready for review:
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/JADE/+/456078/

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

That looks really bad performance. Not only that scans the revision table from top to bottom (>200GB of data) making it slow, it is nondeterministically slow- it will be faster or slower depending on the parameters and existing data.

You can emulate it by running (don't run it, it doesn't finish on production and you may not be able to kill it):

root@db1089[enwiki]> select revision.rev_id, page.page_title from revision left join page     on page.page_title = concat('Diff/', revision.rev_id) where    page.page_namespace = 4 order by revision.rev_id desc limit 100;
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted

Does this seem like a good alternative to maintaining a link table between rev_id and judgment_page.page_id?

Please create a suitable schema with simple queries. Queries that do more complex stuff than point selects using primary keys regarding the revision table will just not work on production, with very close to 1 billion rows. Please test your queries on the wikirreplicas to check they are suitable for production.

Please CC in the future @Marostegui and @mark.

Presumably, @awight meant to put the judgment_page.page_namespace = 810 in the ON clause, not the where clause (Otherwise, I assume he would have used an inner join, or a more obvious IS NOT NULL in the where). Using @jcrespo 's example, that would be roughly like:

select revision.rev_id, page.page_title
from revision
left join page on page.page_title = concat('Diff/', revision.rev_id) AND page.page_namespace = 4
order by revision.rev_id desc
limit 100;

which does not do a full table scan.

@Bawolff That new query you propose makes no sense to me- it just selects the first 100 revisions every single time(e.g. revision id 1 to 100, if they all existed). But sure, if they want to select that (ids and a bunch of NULLs), I don't see any problem with doing that one (although I would batch by rev id without using a limit for more deterministic speed), I am just not sure that is what they *really* want.

Change 461825 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Drop page judgments for this release

https://gerrit.wikimedia.org/r/461825

Thanks for all the attention given to this, and apologies for thinking that the namespace condition would behave the same in the WHERE as in the ON. The heart of what I want to ask is about this condition, though:

page.page_title = concat('Diff/', revision.rev_id)

My instinct is to just use a simple, indexed join table and join directly on keys, but I've heard rumors (elsewhere) that the join on a calculated value is a reasonable approach. Does this sound right?

It's a little bit hard to understand the query in P7570 (for example do you mean page_title.judgment_page instead of judgment_page.page_title?) but I can suggest trying to select from jade tables (as they are smaller) and then join them with revision table specially since we are using the PK index. It's worth noting that from my basic knowledge and according to my bible, It should not matter whether you join jade with revision or the way around and the optimizer should understand and changes order of the join but in reality things might be different and it's better not to risk it.

Change 461825 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Drop page judgments for this release

https://gerrit.wikimedia.org/r/461825

Change 456078 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Secondary indexes for JADE pages

https://gerrit.wikimedia.org/r/456078

Thanks for all the attention given to this, and apologies for thinking that the namespace condition would behave the same in the WHERE as in the ON. The heart of what I want to ask is about this condition, though:

page.page_title = concat('Diff/', revision.rev_id)

My instinct is to just use a simple, indexed join table and join directly on keys, but I've heard rumors (elsewhere) that the join on a calculated value is a reasonable approach. Does this sound right?

@awight would that still part of the query suggested at T202596#4600642?
As Jaime is, I am also too confused with that query, it doesn't make a full scan anymore, but it will always give the same results, could you clarify that?

As pointed out earlier, in order to try to play around with possible queries and see how fast or slow they run, using the wikireplicas can be a good idea, as they have pretty much all the data we have in production (some fields are redacted for privacy of course), but it could be a good way for you to speed up the process of "query reviewing" as you can try different approaches without waiting for the DBAs to check in production. You can play around with different JOINs or WHERE clauses and see the effect on the query time.

Thanks!

@Marostegui I'm not sure if this helps, but I'll try to better illustrate my question using a real-world example. Here's the main query from Special:RecentChanges, with a JADE join added in patch format.

1 SELECT /* SpecialRecentChanges::doMainQuery */
2 rc_id,
3 rc_timestamp,
4 rc_namespace,
5 rc_title,
6 rc_minor,
7 rc_bot,
8 rc_new,
9 rc_cur_id,
10 rc_this_oldid,
11 rc_last_oldid,
12 rc_type,
13 rc_source,
14 rc_patrolled,
15 rc_ip,
16 rc_old_len,
17 rc_new_len,
18 rc_deleted,
19 rc_logid,
20 rc_log_type,
21 rc_log_action,
22 rc_params,
23 rc_comment AS `rc_comment_text`,
24 NULL AS `rc_comment_data`,
25 NULL AS `rc_comment_cid`,
26 rc_user,
27 rc_user_text,
28 NULL AS `rc_actor`,
29 wl_user,
30 wl_notificationtimestamp,
31 page_latest,
32 (SELECT
33 GROUP_CONCAT(ct_tag SEPARATOR ',')
34 FROM `change_tag`
35 WHERE ct_rc_id=rc_id
36 ) AS `ts_tags`,
37+ jaded_judgment
38 FROM `recentchanges`
39 LEFT JOIN `watchlist`
40 ON (wl_user = '1'
41 AND (wl_title=rc_title)
42 AND (wl_namespace=rc_namespace))
43 LEFT JOIN `page`
44 ON ((rc_cur_id=page_id))
45+LEFT JOIN `jade_diff_judgment`
46+ ON rc_this_oldid = jaded_revision
47 WHERE
48 rc_bot = '0'
49 AND (rc_timestamp >= '20180924183425')
50 AND rc_new IN ('0','1')
51 ORDER BY
52 rc_timestamp DESC
53 LIMIT 50

At the beginning of the discussion here, I was thinking that we might be able to do the join using a calculated field, but I no longer feel like that's an alternative we should consider. It's already clear that it would require an additional join on the page table, which is expensive. Feel free to cast some light!

awight moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.
awight updated the task description. (Show Details)

Change 466804 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Secondary schema for JADE indexes

https://gerrit.wikimedia.org/r/466804

Change 466806 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Hooks to maintain judgment link tables

https://gerrit.wikimedia.org/r/466806

Change 466804 abandoned by Awight:
Secondary schema for JADE indexes

Reason:
redundant

https://gerrit.wikimedia.org/r/466804

Change 466808 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Maintenance scripts for judgment indexes

https://gerrit.wikimedia.org/r/466808

@Marostegui These are the proposed indexes, if you want to discuss something concrete:
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/JADE/+/456078/

One questionable decision worth highlighting is that I've split the link table into two tables for "Diff" and "Revision" entity types. This avoids polymorphism, in other words a column that always must be matched like jade_entity_type = 'Diff'. The different judgment types are queried in distinct use cases, so I don't see any benefit in having them stored in a single table for any "revision-like" judgment. Also note that we'll be supporting "Page" judgments in the near future, which target page_id and therefore would require a different table structure and indexes regardless. If this seems wrong, I'm happy to reconsider.

Change 468078 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/JADE@master] Link table model

https://gerrit.wikimedia.org/r/468078

Change 456078 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Secondary schema for JADE indexes

https://gerrit.wikimedia.org/r/456078

Change 468078 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Link table model

https://gerrit.wikimedia.org/r/468078

@Marostegui We've merged the DDL to our repo in order to unblock development, so here are the new coordinates for review:

https://phabricator.wikimedia.org/diffusion/EJAD/browse/master/sql/

Change 466806 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Hooks to maintain judgment link tables

https://gerrit.wikimedia.org/r/466806

Change 466808 merged by jenkins-bot:
[mediawiki/extensions/JADE@master] Maintenance scripts for judgment indexes

https://gerrit.wikimedia.org/r/466808