Page MenuHomePhabricator

New Replica returns different data than old Replica for query using `change_tag` table
Closed, ResolvedPublic

Description

I have some tests of Wiki Education Dashboard that run against this code on wikiedudashboard.toolforge.org: https://github.com/WikiEducationFoundation/WikiEduDashboardTools/blob/master/public_html/revisions.php#L23

Here's a test case...
New database: https://wikiedudashboard.toolforge.org/v8/revisions.php?db=enwiki&usernames[]=Petra+Sen&oauth_tags[]=OAuth+CID%3A+252&oauth_tags[]=OAuth+CID%3A+212&oauth_tags[]=OAuth+CID%3A+542&start=20160920003430&end=20160922003430
Old database: https://wikiedudashboard.toolforge.org/revisions.php?db=enwiki&usernames[]=Petra+Sen&oauth_tags[]=OAuth+CID%3A+252&oauth_tags[]=OAuth+CID%3A+212&oauth_tags[]=OAuth+CID%3A+542&start=20160920003430&end=20160922003430

I have a copy of the query code with the changed server naming in /v8 so everything is identical between those two queries except for the switch to new replicas. The new replicas return 'system: false' for the first revision (ie, not an edit made automatically by the Dashboard) while the second correctly returns 'system; true' for that first revision.

Event Timeline

Here's a query:

SELECT g.page_id, g.page_title, g.page_namespace, c.rev_id, c.rev_timestamp, a.actor_name AS rev_user_text, a.actor_user AS rev_user, case when ct.ct_tag_id IS NULL then 'false' else 'true' end as system, case when c.rev_parent_id = 0 then 'true' else 'false' end as new_article, CAST(c.rev_len AS SIGNED) - CAST(IFNULL(p.rev_len, 0) AS SIGNED) AS byte_change FROM revision_userindex c JOIN actor a ON a.actor_id = c.rev_actor LEFT JOIN revision_userindex p ON p.rev_id = c.rev_parent_id INNER JOIN page g ON g.page_id = c.rev_page LEFT JOIN change_tag_def ctd ON ctd.ctd_name IN ('OAuth CID: 252','OAuth CID: 212','OAuth CID: 542') LEFT JOIN change_tag ct ON ct.ct_rev_id = c.rev_id AND ct.ct_tag_id = ctd.ctd_id WHERE g.page_namespace IN (0,1,2,3,4,5,10,11,118,119) AND a.actor_name IN ('Petra Sen') AND c.rev_timestamp BETWEEN '20160920003430' AND '20160922003430'

I made https://public.paws.wmcloud.org/User:BDavis_(WMF)/T278983%20investigation.ipynb to show this diff easily.

0 {('system', 'true'), ('system', 'false')}
1 {('system', 'true'), ('system', 'false')}
2 {('system', 'true'), ('system', 'false')}
3 set()
4 set()
5 set()
6 {('system', 'true'), ('system', 'false')}
7 {('system', 'true'), ('system', 'false')}
8 {('system', 'true'), ('system', 'false')}

Both databases return the same 9 rows, but they differ in 6 of the 9 based on the case when ct.ct_tag_id IS NULL then 'false' else 'true' end as system, computation. ct here is the change_tag table which is reached via 2 LEFT JOIN operations.

Another notebook: https://public.paws.wmcloud.org/User:JHernandez_(WMF)/Comparing%20replicas%20queries%20for%20T278983.ipynb

I've added ORDER BY g.page_id, ct.ct_tag_id ASC; to the query to try to compare them visually, and instead of the case when ct.ct_tag_id just select ct.ct_tag_id.

From what I can tell the results are the same, they were just sorted differently in the two responses until the ORDER BYs were added.

There is something strange in the query design, because it is returning the same result 3 times with a different system value, but from what is selected we can't know what it was joining with.

@Ragesoss Is your code relying on the order of the returned queries? The test may be failing because the order of the results is differently. I think the two clusters are running a slightly different MariaDB version so that may be affecting the results ordering.


Edit: I've added diffing to the sorted query like bd808 did in his notebook and the results are definitely the same

No, my code doesn't rely on the order of the results.

From what I can tell the results are the same, they were just sorted differently in the two responses until the ORDER BYs were added.

Agreed. I updated my notebook to add an ORDER BY g.page_id to the query and that removes the difference between the two sets of data. Both queries are returning the same 9 rows, but the order the rows are returned in differs between the two clusters. Since the SQL query does not apply any ordering to the results, this can be easily explained by different MariaDB versions in the two clusters and potentially different ordering of rows in the on-disk data files.

Further clarification: the query returns a row for each change_tag_def.ctd_name value checked. In the particular example query, 3 different change tags are examined: 'OAuth CID: 252', 'OAuth CID: 212', and 'OAuth CID: 542'. The chosen user (Petra Sen) has 3 edits in the chosen namespaces during the chosen time range. All three of these edits were associated with the 'OAuth CID: 542' application; none of the edits was associated with the other 2 OAuth applications.

Oh, I see! I never noticed before that this example query was returning duplicates of the same revision; the intent was to return just one entry for a given revision, and have it report system: true if it was from any of the specified change tags. I'll update my test accordingly. Wiki Education Dashboard has been fetching duplicate revisions for a while now, which might explain why the system indication has been incosistent... it depended on which copy of the revision got written to the database first. (Programs & Events Dashboard just has one change tag specified in its config, so this won't have made an operational difference there.)

Ragesoss claimed this task.

Thanks for getting to the bottom of this!