Page MenuHomePhabricator

API: Allow filtering of collections
Closed, ResolvedPublic8 Story Points

Description

We need to be able to do an API request that omits collections that have < 2 collections. The API doesn't support this.
The output would be similar to: http://en.m.wikipedia.beta.wmflabs.org/wiki/Special:GatherLists

Deploying on Vagrant: run mwscript update.php --wiki=mobilewiki --quick (or just vagrant git-update if you prefer things that way) then mwscript extensions/Gather/maintenance/updateCounts.php --wiki=mobilewiki

Event Timeline

Jdlrobson assigned this task to Yurik.
Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a project: Gather.
Jdlrobson added a subscriber: Jdlrobson.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2015, 6:13 PM
Yurik added a subscriber: Yurik.Apr 23 2015, 7:03 PM

This might be problematic perf wise. We should store the number of items in
the list as a separate field in the database, and update it every time an
item is added or removed. This could create discrepancies. Also, watchlist
might need a new hook to handle whenever a page is watched or removed. Once
that is done, we can filter based on item count. Without this, we would
have to do a complex group by+having query, which is very expensive
Jdlrobson added a blocked task: T96222: Add a new public collection feed
view.

TASK DETAIL

https://phabricator.wikimedia.org/T97061

REPLY HANDLER ACTIONS

Reply to comment or attach files, or !close, !claim, !unsubscribe or

!assign <username>.

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Yurik, Jdlrobson
Cc: Aklapper, Jdlrobson

Jdlrobson moved this task from Needs triage to Must haves on the Gather board.Apr 27 2015, 4:30 PM
Tgr added a subscriber: Tgr.Apr 29 2015, 3:52 PM

A possibly more efficient implementation (but it only works for lists of 2 or more items, not N or more) is SELECT gli_id FROM gather_list_item gli WHERE EXISTS (SELECT 1 FROM gather_list_item gli2 WHERE gli.gli_gl_id = gli2.gli_gl_id AND NOT (gli.gli_namespace = gli2.gli_namespace AND gli.gl_title = gli2.gl_title). That can in theory be resolved with a sequential walk through a single (gli_gl_id, gli_namespace, gli_title) index; no idea how the MySQL optimizer treats it in practice though.

@aaron any ideas about this?

Yurik added a comment.Apr 29 2015, 5:10 PM

@Tgr, you mean

SELECT gli_gl_id FROM gather_list_item gli WHERE
EXISTS (SELECT 1 FROM gather_list_item gli2
   WHERE gli.gli_gl_id = gli2.gli_gl_id AND NOT (gli.gli_namespace = gli2.gli_namespace AND gli.gli_title = gli2.gli_title));

vagrant ssh + mysql is your friend :)

explain of that query returns:

+----+--------------------+-------+-------+---------------+--------------+---------+--------------------+------+--------------------------+
| id | select_type        | table | type  | possible_keys | key          | key_len | ref                | rows | Extra                    |
+----+--------------------+-------+-------+---------------+--------------+---------+--------------------+------+--------------------------+
|  1 | PRIMARY            | gli   | index | NULL          | gli_id_order | 8       | NULL               |    2 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | gli2  | ref   | gli_id_order  | gli_id_order | 4       | wiki.gli.gli_gl_id |    1 | Using where; Using index |
+----+--------------------+-------+-------+---------------+--------------+---------+--------------------+------+--------------------------+

Now, the big question is how you want to filter that list -- e.g. show it all for a given user:

SELECT * FROM gather_list gl WHERE gl.gl_user = 1 AND
EXISTS (SELECT 1 FROM gather_list_item gli WHERE
gli.gli_gl_id = gl.gl_id AND EXISTS (SELECT 1 FROM gather_list_item gli2
   WHERE gli.gli_gl_id = gli2.gli_gl_id AND NOT (gli.gli_namespace = gli2.gli_namespace AND gli.gli_title = gli2.gli_title)));

Which shows this execution info:

+----+--------------------+-------+------+---------------+---------------+---------+--------------------+------+--------------------------+
| id | select_type        | table | type | possible_keys | key           | key_len | ref                | rows | Extra                    |
+----+--------------------+-------+------+---------------+---------------+---------+--------------------+------+--------------------------+
|  1 | PRIMARY            | gl    | ref  | gl_user_label | gl_user_label | 4       | const              |    2 | Using where              |
|  2 | DEPENDENT SUBQUERY | gli   | ref  | gli_id_order  | gli_id_order  | 4       | wiki.gl.gl_id      |    1 | Using where; Using index |
|  3 | DEPENDENT SUBQUERY | gli2  | ref  | gli_id_order  | gli_id_order  | 4       | wiki.gli.gli_gl_id |    1 | Using where; Using index |
+----+--------------------+-------+------+---------------+---------------+---------+--------------------+------+--------------------------+

The info seems ok, but would be good if a DBA looks at it.

Tgr added a comment.Apr 30 2015, 3:44 PM

You can do with primary key lookups and a single dependent subquery for pretty much any filtering as far as I can see, e.g.

SELECT * FROM gather_list gl JOIN gather_list_item gli ON gli.gli_gl_id = gl.gl_id
WHERE gl.gl_user = 1 AND EXISTS (
  SELECT 1 FROM gather_list_item gli2 
  WHERE gli.gli_gl_id = gli2.gli_gl_id AND NOT (gli.gli_namespace = gli2.gli_namespace AND gli.gli_title = gli2.gli_title)
);

That's still not telling much about performance, though. Dependent subqueries with a bunch of a=b conditions are supposed to be efficient; the optimizer even rewrites other subqueries to EXISTS whenever possible. But I have no clue about the performance of EXISTS subqueries with negated equality conditions.

Could we run some performance/stress tests for comparison on some sample data?

Tgr claimed this task.Apr 30 2015, 4:56 PM
Tgr added a comment.Apr 30 2015, 5:19 PM

I will rather move on with Yuri's suggestion from T97061#1232118. The above query is too rigid - it might work when filtering for >=2 items but would be completely useless when filtering for >=4 or ordering by item count, while caching the counts in a dedicated DB field will work in each case.

Jdlrobson moved this task from Must haves to In sprint on the Gather board.Apr 30 2015, 5:33 PM

@Tgr fine with me if you think this is the way forward.
I think even if there are edge cases with watchlist this is not a problem (since watchlist's are private and won't show in view)

Jdlrobson edited a custom field.May 8 2015, 10:01 PM
Jdlrobson edited a custom field.May 11 2015, 6:06 PM
Jdlrobson moved this task from In Analysis to Ready for dev on the Gather Sprint Help! board.

Change 210884 had a related patch set uploaded (by Gergő Tisza):
[WIP] Track collection item counts

https://gerrit.wikimedia.org/r/210884

Change 210885 had a related patch set uploaded (by Gergő Tisza):
[WIP] Filter lists by item count

https://gerrit.wikimedia.org/r/210885

Tgr added a comment.May 14 2015, 11:49 AM

WIP version (no tests) up in case you want to build on it / to review the syntax. I went with minitems=2 instead of filter=something in the end because it is more generic, but that's easy to change if you prefer otherwise.

@Tgr says this is almost done par tests. He's going to finish them up today.

Tgr added a comment.May 21 2015, 5:12 PM

Sorry, I misunderstood you (connection was poor). This *is* done, apart from tests (which are blocked on T96904). I was talking about T97704.

@Tgr seems like you didn't misunderstand :) I mean't 'except tests'. Are you writing tests today?

Tgr added a comment.May 21 2015, 8:39 PM

Are you writing tests today?

No, I wanted to get a prototype-ish version of T97704 done first. I'll add tests next.

Change 210884 merged by jenkins-bot:
Track collection item counts

https://gerrit.wikimedia.org/r/210884

Change 210885 merged by jenkins-bot:
Filter lists by item count

https://gerrit.wikimedia.org/r/210885

Should have been tagged as a schema change, duh.

Tgr updated the task description. (Show Details)Jun 4 2015, 5:44 PM
Tgr added a comment.Jun 4 2015, 5:47 PM

Added deploy instructions.

If Gather gets used more widely, this will probably need an index on gl_item_count. As long as we only have a few thousand lists, it's no big deal though.

@Tgr who is responsible for making sure those instructions get followed during deploy?

Tgr added a comment.Jun 4 2015, 6:53 PM

@Tgr who is responsible for making sure those instructions get followed during deploy?

The person doing the deploy, I suppose? Although if that's unwanted, they could be done at any time before/after the deploy - the SQL patch could be run right now, the update could be run days after the deploy. (Of course the result of queries with the minitems parameter would be wrong until that time.)

At any rate we should either keep this task open or have a new one to track this - DB patches are not applied automatically on production, so if forget to run the SQL patch before this code goes live, there will be serious breakage.

Beta I will figure out tomorrow.

Tgr added a comment.Jun 4 2015, 6:56 PM
In T97061#1338432, @Tgr wrote:

Although if that's unwanted, they could be done at any time before/after the deploy - the SQL patch could be run right now

No reason not to do that, I guess; I'll run it.

Tgr updated the task description. (Show Details)Jun 4 2015, 11:04 PM