Page MenuHomePhabricator

Remove reference to fields replaced by the actor table from WMCS views
Closed, ResolvedPublic

Description

T188327: Deploy refactored actor storage is at the point where we can usefully begin planning the removal of the xx_user and xx_user_text fields that are being replaced by the actor table, as we're beginning to stop writing them in production in preparation for dropping them. Before we can drop them in WMF production, we'll have to make the following changes to the WMCS views.

The fields that need changing are:

  • archive: ar_user and ar_user_text.
  • filearchive: fa_user and fa_user_text.
  • image: img_user and img_user_text.
  • ipblocks: ipb_by and ipb_by_text.
  • logging: log_user and log_user_text.
  • oldimage: oi_user and oi_user_text.
  • recentchanges: rc_user and rc_user_text.
  • revision: rev_user and rev_user_text.

In non-compat views these fields should be removed. In compat views, the existing conditional should be simplified to remove the case that reads these fields when the corresponding xx_actor fields are null/0.

Note the "_userindex" tables are not obsolete as they're still needed for queries by xx_actor, although they are perhaps slightly misnamed now.

The end result should be no change in user-visible behavior for the compat views, as everything should already be hitting the actor table cases, while users using the non-compat views will start getting errors about nonexistent fields if they haven't already updated their code to join with actor.

Event Timeline

Change 510595 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/puppet@production] wiki replicas: Remove reference to old user fields

https://gerrit.wikimedia.org/r/510595

Announced a deployment of 2019-05-27. Hopefully that will work pretty well for everyone.

Hi. I just fixed my tool on toolforge, but... Am I correct to assume that actor_user is null only for anonymous actors (IP)?

So below will only select non-anon actors right? And will also work after the deployment?

SELECT actor_id as id, actor_name FROM actor
	WHERE
		actor_id IN ($idList)
		AND actor_user IS NOT NULL

Hi,
I tried to update one of my tools (intersect-contribs) to use the actor table instead of the revision.rev_user_text field.

Joining the revision and actor tables on revision.rev_actor = actor.actor_id is painfully slow and makes my tool unusable.
I guess that the reason is that there is no index on the revision.rev_actor column (which is nullable, by the way).

Would it be possible to have a view like revision_userindex (which has index on rev_user_text) with an index on the rev_actor column instead?
In this way, this very common join would be way faster.

I tried to update one of my tools (intersect-contribs) to use the actor table instead of the revision.rev_user_text field.

Joining the revision and actor tables on revision.rev_actor = actor.actor_id is painfully slow and makes my tool unusable.

Can you give an example of a slow query?

Sure! In my case, I'm trying to find the the IDs of all the pages edited by both User:Pietrodn and User:Frieda (the database is itwiki_p).

SELECT rev_page, COUNT(rev_id) AS edit_count
FROM revision
JOIN actor ON actor_id = rev_actor
WHERE actor_name IN ('Pietrodn', 'Frieda')
GROUP BY rev_page
HAVING COUNT(DISTINCT actor_id)=2

...but this simpler query, consisting of a plain JOIN, is also slow:

SELECT rev_page
FROM revision
JOIN actor ON actor_id = rev_actor
WHERE actor_name IN ('Pietrodn', 'Frieda')

Use FROM revision_userindex instead of FROM revision.

Still slow. The query executed in 21 minutes. revision_userindex has no index on rev_author either, indeed.

The same query on the indexed rev_user_text field executed in 7 seconds, instead:

SELECT rev_page
FROM revision_userindex
WHERE rev_user_text IN ('Pietrodn', 'Frieda')

This seems to return quickly, on itwiki_p:

SELECT rev_page, COUNT(rev_id) AS edit_count
FROM revision_userindex
JOIN actor ON actor_id = rev_actor
WHERE actor_name IN ('Pietrodn', 'Frieda')
GROUP BY rev_page
HAVING COUNT(DISTINCT actor_id)=2
838 rows in set (6.09 sec)

Weird. Runs fast on SQL console but slow in a web-based tool.
Anyway, I see this is being taken care of. https://phabricator.wikimedia.org/T221339

ahhh, that being in-progress might explain the difference observed. good catch

Announced that this is now going to be scheduled for June 3rd after more feedback and finding at least one issue that, I hope, is fixed over at T221339.

I changed a code of one of my bots to using actor_name instead of log/rev_user_text and bot became work several times longer (more than a hour vs. 5-10 min). This bot makes two requests to DB replicas on Toolforge:

select cast(actor_name as char) user, log_type, log_action, log_namespace, cast(log_title as char) title from logging join actor on actor_id=log_actor where (actor_user in (select ug_user from user_groups where (ug_group = "sysop" or ug_group = "closer")) and log_timestamp >= 20190101000000 and log_timestamp < 20190601000000);
select page_namespace, cast(page_title as char) title, cast(actor_name as char) user from revision join page on rev_page=page_id join actor on actor_id=rev_actor where (actor_user in (select ug_user from user_groups where (ug_group = "sysop" or ug_group = "closer")) and rev_timestamp >= 20190101000000 and rev_timestamp < 20190601000000);

(closer is semi-admin user group in ruwiki)

In T223406#5226490, @MaxBioHazard wrote:

I changed a code of one of my bots to using actor_name instead of log/rev_user_text and bot became work several times longer (more than a hour vs. 5-10 min). This bot makes two requests to DB replicas on Toolforge:

select cast(actor_name as char) user, log_type, log_action, log_namespace, cast(log_title as char) title from logging join actor on actor_id=log_actor where (actor_user in (select ug_user from user_groups where (ug_group = "sysop" or ug_group = "closer")) and log_timestamp >= 20190101000000 and log_timestamp < 20190601000000);
select page_namespace, cast(page_title as char) title, cast(actor_name as char) user from revision join page on rev_page=page_id join actor on actor_id=rev_actor where (actor_user in (select ug_user from user_groups where (ug_group = "sysop" or ug_group = "closer")) and rev_timestamp >= 20190101000000 and rev_timestamp < 20190601000000);

(closer is semi-admin user group in ruwiki)

The actor view uses subqueries for data scrubbing unfortunately. We don't yet have a way to speed that up directly, but if @Anomie has any advice for querying it I can offer to users, I am eager to hear it. We are working on additional speed-ups for that view. Work on that was a bit stalled by some other tasks recently.

Change 510595 merged by Bstorm:
[operations/puppet@production] wiki replicas: Remove reference to old user fields

https://gerrit.wikimedia.org/r/510595

Mentioned in SAL (#wikimedia-operations) [2019-06-03T15:39:09Z] <bstorm_> T223406 labsdb1012 updated views for actor table changes

Change 514063 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: depool labsdb1010 for view updates

https://gerrit.wikimedia.org/r/514063

Change 514063 merged by Bstorm:
[operations/puppet@production] wikireplicas: depool labsdb1010 for view updates

https://gerrit.wikimedia.org/r/514063

Mentioned in SAL (#wikimedia-operations) [2019-06-03T16:30:40Z] <bstorm_> T223406 depooled labsdb1010 for view updates

Change 514185 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: depool labsdb1011 for view updates

https://gerrit.wikimedia.org/r/514185

Mentioned in SAL (#wikimedia-operations) [2019-06-03T22:08:59Z] <bstorm_> T223406 repooled labsdb1010 after completing view updates

Change 514185 merged by Bstorm:
[operations/puppet@production] wikireplicas: depool labsdb1011 for view updates

https://gerrit.wikimedia.org/r/514185

Change 514199 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: depool labsdb1009 for view updates

https://gerrit.wikimedia.org/r/514199

Mentioned in SAL (#wikimedia-operations) [2019-06-04T00:09:26Z] <bstorm_> T223406 repooled labsdb1011 after completing view updates

Change 514199 merged by Bstorm:
[operations/puppet@production] wikireplicas: depool labsdb1009 for view updates

https://gerrit.wikimedia.org/r/514199

Mentioned in SAL (#wikimedia-operations) [2019-06-04T01:10:57Z] <bstorm_> T223406 depooled/repooled labsdb1009 for view updates

This is deployed. Running a quick check on some views shows the fields are gone.

I set up a wiki page to help direct people to if they have issues. Hopefully I can make it a bit better as I see where people are having trouble. I know some folks didn't finish migrating to the new schema since I saw their queries when I depooled things, so I expect some issues. https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas