Page MenuHomePhabricator

Read queries detected on the x1 master by Echo extension
Closed, ResolvedPublic

Description

SELECT /* EchoNotificationMapper::fetchByUserInternal */ has been seen running on the master despite x1-master-eqiad having weight 0. That is either a bug, or there is lag on the slave (not seen at the time), or it has to be justified, as reads to the master will be very slow and located on a separate datacenter in active-active configuration. Also, in the event of a master being down, the service should continue being in read only mode as much as reasonable (not possible if read queries are sent to the master (I am not familiar with the extension, some reads can be justified on the master).

Event Timeline

jcrespo created this task.May 4 2017, 4:21 PM
Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptMay 4 2017, 4:21 PM
jcrespo updated the task description. (Show Details)May 4 2017, 4:22 PM

This could be a regression on T94448.

Catrope added a subscriber: Catrope.May 4 2017, 5:52 PM

This could be a regression on T94448.

I don't think so: the logstash query that Aaron mentioned at T94448#1632068 still produces no results. That suggests that these master reads are done on POST, not GET.

That said, I'll still investigate why we are reading from the master. There might be a good reason for it, or there might not be.

Most likely these queries come from NotifUser::markRead() (and markUnread), which mark a notification as read/unread and then recompute notification counts and something else (*) from the master because they just wrote there. Echo's strategy here is to recompute these things from the master right after a change happens, and then put them in cache where they will hopefully survive until the next change (and if not, they'll be recomputed from a slave). I think that might not be Aaron's favorite strategy for multi-DC reasons (and it's not quite my favorite either), but it's what we have right now and it doesn't involve reading from the master on GET.

I also looked at other code paths in Echo (grepped for DB_MASTER) and I could only find one code path that would potentially read from the master in a scenario where we didn't just write something, and that's when trying to load an event by ID: if we don't find the event in the replica DB, we'll query the master for it. I have some doubts as to whether this is a good idea.

(*) Whether the number of unread user talk page-related notifications you have left is >0 or not

Change 352066 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] EventMapper: Don't retry failed lookup queries on the master on GET

https://gerrit.wikimedia.org/r/352066

I also looked at other code paths in Echo (grepped for DB_MASTER) and I could only find one code path that would potentially read from the master in a scenario where we didn't just write something, and that's when trying to load an event by ID: if we don't find the event in the replica DB, we'll query the master for it. I have some doubts as to whether this is a good idea.

It isn't, so I changed this to only try the master if we've already written. This approach is also used in a number of similar places in MW core.

Change 352066 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] EventMapper: Don't retry failed lookup queries on the master on GET

https://gerrit.wikimedia.org/r/352066

jcrespo added a comment.EditedMay 5 2017, 6:48 AM

I may be wrong here (you are the experts), but couldn't it (in a "just wrote" recomputing case) wait for the slave to catch up, (I know at low level it does master_gtid_wait/master_pos_wait, not sure what is the right mediawiki function), and then compute it on the slave? I saw the select quite frequent on literally 30 seconds of running show processlist on the master, and echo wasn't even my target (was looking closely at Cognate).

I may be wrong here (you are the experts), but couldn't it (in a "just wrote" recomputing case) wait for the slave to catch up, (I know at low level it does master_gtid_wait/master_pos_wait, not sure what is the right mediawiki function), and then compute it on the slave? I saw the select quite frequent on literally 30 seconds of running show processlist on the master, and echo wasn't even my target (was looking closely at Cognate).

Yeah that's more or less what I hinted at when I said this isn't Aaron's favorite strategy. If we were building this from scratch, we'd have it invalidate the cache on write and populate it on read. The latter might populate from a slave, but that's worked around by some clever freshness logic which makes the invalidation last for 5s (I think?), which is assumed to be more than slave lag will reasonably be. That's my understanding of the currently preferred strategy anyway, and we may well be forced to migrate Echo to this to better support multi-DC.

I'm going to close this task since I believe we've addressed what can be addressed in the short term; I've created T164860: Update Echo's caching strategy for multi-dc compatibility for the longer-term work of changing Echo's caching model, since it's my understanding that that will be a blocker for multi-DC.

Catrope closed this task as Resolved.May 9 2017, 7:18 PM