Page MenuHomePhabricator

Implement logic from updateMenteeData in hooks
Closed, DeclinedPublic

Description

We currently have a maintenance script, maintenance/updateMenteeData.php, for gathering data on each mentor's mentees, populating database table(s) with data, and this information is then accessed on Special:MentorDashboard. The script runs via a cron job.

It maybe more scalable to run the data update operations from various hooks (e.g. after edits), using DeferredUpdates or jobs. Another benefit of doing the data updates this way is that we would have something more close to real time information in the mentor dashboard.

We could retain updateMenteeData.php for backfilling data or for compiling data that doesn't make sense to compute from existing hooks.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kostajh added subscribers: mewoph, Tgr.

(This is just my opinion; if others prefer to not go this route, let's discuss!)

Moving to needs discussion per @kostajh's comment

Having an alternative representation of state that's kept in sync by a stream of events, instead being periodically recalculated and replaced completely, is a very maintenance-intensive approach IMO. I would look for cheaper ways (in the sense of developer time) of scaling first, unless we consider near-real updates important.

is a very maintenance-intensive approach IMO.

Could you clarify why you view this as maintenance-intensive?


I guess this is somewhat moot since the script is already built, but looking at what we have:

$mainData = [
			'username' => $this->getUsernames( $userIds ),
			'reverted' => $this->getRevertedEditsForUsers( $userIds ),
			'questions' => $this->getQuestionsAskedForUsers( $userIds ),
			'editcount' => $this->getEditCountsForUsers( $userIds ),
			'registration' => $this->getRegistrationTimestampForUsers( $userIds ),
			'last_edit' => $this->getLastEditTimestampForUsers( $userIds ),
			'blocks' => $this->getBlocksForUsers( $userIds ),
		];

$this->getRevertedEditsForUsers( $userIds ),

use onPageSaveComplete, check if $editResult->isRevert() is true. Update the count in the table for that mentee.

$this->getQuestionsAskedForUsers( $userIds ),

Update the count for the user in QuestionPoster.php

$this->getEditCountsForUsers( $userIds ),

Increment the count in onPageSaveComplete

$this->getRegistrationTimestampForUsers( $userIds

Set this in the onLocalUserCreated hook

$this->getLastEditTimestampForUsers( $userIds ),

Increment this in onPageSaveComplete

$this->getBlocksForUsers( $userIds )

Not sure offhand, I imagine there is a hook for this as well.

Overall, it seems to me more maintainable (use a hook that has a contract rather than the SQL queries that have internal knowledge of other tables) and less hassle for extensibility to invert the flow of information, and retain the update script for filling information that can't be gathered via hooks (like inactiveness, although that could probably also be triggered by a deduplicated job that fires every N days after a visit to Special:MentorDashboard).

The script would have had to be written anyway, because we need to deploy to which which already have mentees but don't yet have any data, need to be able to recover from errors, and to recalculate after business rule changes. And when a discrepancy is found in the data, a full recalculation script is easy to debug (you just tweak, re-run and see what happens) while an event trigger based approach leaves you guessing what exactly might have gone wrong and when (and whether it's even a bug in the code or some infrastructure failure, especially if we do some of these updates in deferreds/jobs).

MediaWiki takes the trigger approach for site statistics (like number of articles) for performance reasons; they get out of sync all the time and need to be recalculated periodically, it is a bit of a pain. We took the trigger approach for the Add Link index, where we don't have other options, and that hasn't been fun either. Triggers are certainly viable as a last resort, but I don't think they should be preferred.

The script would have had to be written anyway, because we need to deploy to which which already have mentees but don't yet have any data, need to be able to recover from errors, and to recalculate after business rule changes. And when a discrepancy is found in the data, a full recalculation script is easy to debug (you just tweak, re-run and see what happens) while an event trigger based approach leaves you guessing what exactly might have gone wrong and when (and whether it's even a bug in the code or some infrastructure failure, especially if we do some of these updates in deferreds/jobs).

Yes, we would need a script to be able to backfill data, it's true. I guess my main concerns with the longer term viability of this approach are 1) runtimes (T290609: Make mentee overview module's updateMenteeData.php scale better), 2) querying DB tables directly seems brittle and prone to problems over time, 3) how does this scale when there are 10, 100, 1000 times more mentors/mentees being processed?

MediaWiki takes the trigger approach for site statistics (like number of articles) for performance reasons; they get out of sync all the time and need to be recalculated periodically, it is a bit of a pain. We took the trigger approach for the Add Link index, where we don't have other options, and that hasn't been fun either. Triggers are certainly viable as a last resort, but I don't think they should be preferred.

True, but that is also is a hybrid case that doesn't really compare to the proposal here, because refreshLinkRecommendations.php runs as a periodic script (like updateMneteeData.php) but it also uses triggers that involve interactions with delayed updates to the search index, which adds additional complexity.

Scaling is a fair point, and if the update script cannot be made significantly faster, triggers might be the only way to go. I just think we should exhaust other options before going that way.

Wrt brittleness, DB queries are not protected by stable interface policies. OTOH they can only be broken by schema changes, which are rare and easy to identify; and they are easier to debug. (And the queries need to be debugged / kept working anyway, because of backfilling.) On the whole, I think that favors queries (assuming the performance issues can be worked out).

There's also the matter of storage space. There are lots of mentees (technically every user is a mentee), we might want to limit storage by discarding inactive users. A query approach can easily handle user inactivation and reactivation. With triggers, if you discard data when the user goes inactive, you cannot recover it when the user becomes active again.

Thanks everyone for discussing this idea here. I think that for now, this task should be closed as declined. The update script is now performing much better than it used to (see blog post for a summary), and as a result, the updates are calculated every three hours (rather than only daily). That's already a large improvement. Even though real-time updates would be potentially useful, it doesn't sound as important as with daily updates to me, especially with the possibility of T293454: Mentor dashboard: M1 mentee overview: Add "last updated" indicator and ability to manually refresh mentee data being done instead.

To summarize, I think the efforts associated with re-implementing the updates (ie. a significant portion of the module, it's more or less what took the most time to do right) via hooks aren't sufficiently comensated with the benefits coming from the work. For that reason, I'm boldly closing the task as declined.