Page MenuHomePhabricator

Add a link engineering: Create event for event gate to update search index after obtaining link recommendations
Closed, ResolvedPublic

Description

Please see https://wikitech.wikimedia.org/wiki/Add_Link and the background reading linked there for details.

Summary: we will have a maintenance script running via cron that is going to iterate over a list of articles and fetch link recommendations from an external service. If that service returns recommendations, then we want to update the ElasticSearch document with metadata that says "yes this article has link recommendations". If the service does not return recommendations, then we need to update the ES document to say "no it does not have link recommendations". The updates are done by sending an event to eventgate with the relevant data needed by the search pipeline to process.

We were thinking it would be called "hasrecommendations" and it would accept an array of strings, the only one we'd set now is "links". (Later this fiscal year we will be working with image recommendations, so we should consider a search keyword that allows us to do queries like "hasrecommendations:images|links".)

In discussion with the Discovery-ARCHIVED team, it was suggested that we store it alongside other ML predicted classes.

To clarify:

  • Name of field to store data in
  • format of metadata to use in the event that is sent to event gate

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There is also the possibility of doing the cron maintenance script approach (not the ORES-like pipeline you've described) which is described in T261408: Add a link engineering: Maintenance script for retrieving, caching, and updating search index, and then issuing a null edit after retrieving link recommendations; then we could implement the SearchDataForIndex hook and retrieve the recommendations from our SQL table.

I know @Tgr is not enthusiastic about the null edit idea but it seems to me that it would be better than trying to implement a caching layer in the Link Recommendation Service.

I suspect our difference is on where linkrecommendations gets called? I re-read a few times and I think we are thinking of different processes.

In the ORES-like pipeline you describe above, you would have some code that calls linkrecommendationservice.wikimedia.org (or whatever) with the article ID and the wiki ID. Then several seconds later you would get back a response like:

Perhaps I'm misinterpreting the "you" here, but nothing in the search pipeline would know that linkrecommendationservice exists. Some code in the production network, probably mediawiki but it's up to your team, needs to generate an eventgate event that says approximately "page 1234 on site abcd has link recommendations". From there the search data pipelines will update the appropriate page. Who/what calls the link recommendation service, what happens to those predictions other than generating the event seem out of scope.

Basically, i'm saying instead of calling some function that does a null edit, call a function that sends an eventgate event for that page id.

I concur with the idea of generating an event (from the maint script described in the task description) to instruct the search pipeline what to update (the data to store is in the event itself), this gives a chance to join multiple events for the same page together and perform a single update.
Drawback is the added latency: latency of the cron schedule for the maint script + latency of the search pipeline. The latency of the search pipeline is I think something we want by design, the latency induced by the cron schedule of the maint script could (in theory IIUC) be worked out in a more event driven fashion.

Thanks @EBernhardson & @dcausse for explaining, and for your patience. OK, I think I mostly understand now although I am still confused about one thing, see below.

Some code in the production network, probably mediawiki but it's up to your team, needs to generate an eventgate event that says approximately "page 1234 on site abcd has link recommendations". From there the search data pipelines will update the appropriate page
Basically, i'm saying instead of calling some function that does a null edit, call a function that sends an eventgate event for that page id.

I think this makes sense. If I understand correctly: both for our maintenance script that is running on cron, as well as in the context of page update and page deletion hooks, we would use the EventBus extension to create an event and then send it. (As an aside, I don't see any prior art of other extensions doing this, but maybe I am missing something.)

What I don't understand: you've both referenced that idea that we only want to make a single update for a page. Won't one of the jobs in CirrusSearch have already updated the search index for the article (for the page edit / delete example) by the time the event gate workflow finishes? To try to restate that:

User makes an edit to an article:

  1. GrowthExperiments via the PageSaveCompleteHook calls the link recommendation service, sees if there are link recommendations, then crafts an event for the revision (Yes we have link recommendations for this revision, or no we do not) and sends it via EventBus. Then the search data pipelines update the document for that page.
  2. CirrusSearch has a separate process for writing updates in near real-time to the search index as described in https://wikitech.wikimedia.org/wiki/Search#Realtime_updates

Isn't the consequence of points 1 and 2 above that we end up with two updates to the search index for a single revision?

Or... are you saying that, if we use EventBus to send an event with information about whether or not there are link recommendations, your code in the search pipeline in event gate will then examine hundreds or thousands of these events at one go (potentially hourly), and then make a single write to Elastic search with updates for the hundreds or thousands of articles that we have updated information for?

Thanks @EBernhardson & @dcausse for explaining, and for your patience. OK, I think I mostly understand now although I am still confused about one thing, see below.

Some code in the production network, probably mediawiki but it's up to your team, needs to generate an eventgate event that says approximately "page 1234 on site abcd has link recommendations". From there the search data pipelines will update the appropriate page
Basically, i'm saying instead of calling some function that does a null edit, call a function that sends an eventgate event for that page id.

I think this makes sense. If I understand correctly: both for our maintenance script that is running on cron, as well as in the context of page update and page deletion hooks, we would use the EventBus extension to create an event and then send it. (As an aside, I don't see any prior art of other extensions doing this, but maybe I am missing something.)

Sending an event to eventgate is a relatively easy operation, a dedicated Logger channel can be configured to point to eventbus and the stream configured through wgEventStreams so that simply logging the right context from your extension will produce an event. You could check how api requests are logged (see \ApiMain::logRequest). Note that I don't know if this is the preferred way to do this in MW though.

What I don't understand: you've both referenced that idea that we only want to make a single update for a page. Won't one of the jobs in CirrusSearch have already updated the search index for the article (for the page edit / delete example) by the time the event gate workflow finishes? To try to restate that:

User makes an edit to an article:

  1. GrowthExperiments via the PageSaveCompleteHook calls the link recommendation service, sees if there are link recommendations, then crafts an event for the revision (Yes we have link recommendations for this revision, or no we do not) and sends it via EventBus. Then the search data pipelines update the document for that page.
  2. CirrusSearch has a separate process for writing updates in near real-time to the search index as described in https://wikitech.wikimedia.org/wiki/Search#Realtime_updates

Isn't the consequence of points 1 and 2 above that we end up with two updates to the search index for a single revision?

yes we have a "near realtime" update process for page content updates (point 2) and another one for "external data" updates (e.g. point 1). We're talking about generalizing "external data" updates so that we avoid updating the doc as many times as we have "external data" fields to update. Example of "external data" updates are:

  • ores topics
  • page popularity computed from pageviews (does not depend on edit events)
  • recommendation flags

Not all these cases involve a page update, but when a page update is also the source event of an "external data" update then yes we'll end up sending 2 updates to elastic.

Ideally if all the "external data" is fetched sequentially and synchronously during the "near realtime" process inside MW then only one update would be necessary (for the kind of updates that only depend on the page) but I think this option was already ruled out for ores topic models as it does not seem it can scale.

kostajh renamed this task from Add a link engineering: Update search index after obtaining link recommendations to Add a link engineering: Create event for event gate to update search index after obtaining link recommendations.Sep 1 2020, 10:06 AM
kostajh updated the task description. (Show Details)

It looks like we all agree on the implementation. On the Search Platform side, we will open another task to start consuming this event when it is available.

Moving forward with an event, some format needs to be decided. Very similar data is currently consumed via mediawiki_revision_score events. Reading through the current spec, only the descriptions seem overly specific to ORES. The event itself is a reasonably generic representation of predicted classes. We could fit predicted links here, with the links as the prediction, and probability per link. But maybe that's improper, it feels a bit odd.

Perhaps also work at least thinking about, the prediction here is a set of links but the only piece necessary to solve this ticket is a boolean about predictions existing. That reduction in information can happen either before or after the event, i suspect after the event is more useful for potential future consumers.

@Ottomata You might have some useful thoughts here. @calbon You might also have some thoughts on modeling ml predicted classes and values in events.

OO lots of stuff in this ticket; haven't groked it all. But, I think we could re-use the mediawiki/revision/score schema if it makes sense. The schema could be used for a new stream, perhaps mediawiki.revision-link-recommendation, or something like it? I'm not familiar with the data model needed, but it also may make sense to make a new schema, if this one doesn't fit.

Moving forward with an event, some format needs to be decided. Very similar data is currently consumed via mediawiki_revision_score events. R

FYI mediawki/event-schemas is a deprecated repository. Instead use https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/primary/ and/or https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/secondary/.

We would use the EventBus extension to create an event and then send it. (As an aside, I don't see any prior art of other extensions doing this, but maybe I am missing something.)

You can use the EventBus extension to send events, but I'm not sure if that is quite right here. If your events can be emitted as part of either a MW request or a MW job, then that might make sense. The EventLogging extension is currently being updated with a PHP API to send events to EventGate.

Or, you can just POST the events to EventGate yourself. I guess if your code is already in MediaWiki land it might make sense to use EventBus or EventLogging.

You could check how api requests are logged (see \ApiMain::logRequest). Note that I don't know if this is the preferred way to do this in MW though.

This has been very inflexible for us to upgrade things in the past, so I don't think this is the preferred way.

Happy to jump in a meeting to dicsuss this more sometime (Though I'll be off Sept. 28 - Oct 9).

I am not certain what this ticket is about so I'll admit I have no thoughts smart or otherwise. Give me a bit to dig into the ticket discussion and figure it out

Tgr edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.

As I understand it we'll need two kinds of updates:

  • Producing recommendations, to ensure there is always a healthy pool of tasks for new editors to pick from. This is not particularly time-sensitive (we can just make a large enough pool that won't run out in an hour), so we can send events from a cronjob and the kafka consumer can also process them periodically in batches.
  • Invalidating recommendation, ie. making sure a search query with hasrecommendations doesn't return an article where the recommendations have just been acted upon by another user. This needs to be instantaneous, or at least a lot more timely than once per hour. Normally we can just rely on the article edit to do this (in general, we want to invalidate the old recommendation on edit, and we don't care about regenerating immediately, as we can just rely on the pool size as above), but a user might reject all the recommendations, in which case there is no edit and we still want to remove that page from the result set. We could do a null edit but that's wasteful since we have no need for parsing the page, updating links tables etc.

So ideally we'd need two separate channels, one for adding recommendations to the search index (can be batched, currently it happens entirely independent of edits but in the future might be a follow-up event to edits that has too much delay to be included in the main update) and one for removing recommendations from the index (needs to be processed in near real time, happens independent of edits).

Moving forward with an event, some format needs to be decided. Very similar data is currently consumed via mediawiki_revision_score events. Reading through the current spec, only the descriptions seem overly specific to ORES. The event itself is a reasonably generic representation of predicted classes. We could fit predicted links here, with the links as the prediction, and probability per link. But maybe that's improper, it feels a bit odd.

A prediction is more than a link, it's something along the lines of link target + text to be linked + additional data to make the identification of the text to be linked unique.
I suppose we can stringify it and turn it into a key in the score schema if we try hard enough, but I see little value in it.

We could reduce the prediction to a boolean and that would fit better (e.g. ORES damaging/goodfaith predictions work like that), although as you say it's less flexible for future use cases. It would let us reuse the existing ORES pipeline though, so I think it's the Search team's call whether you would prefer that. On our end it would make very little difference.

Change 642225 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[schemas/event/primary@master] Add recommendation-related schemas

https://gerrit.wikimedia.org/r/642225

After vacillating for a while I went with new events (mediawiki/revision/recommendation/add, mediawiki/revision/recommendation/invalidate) instead of reusing mediawiki.revision.score. Semantically it seemed wrong - the event does not contain any score, it is a notification that a recommendation was generated.

Neither event needs any data now (although that might change), we are only interested in tracking which pages do / do not have recommendations.

@Ottomata I imagine the right way to send an event from MediaWiki is using EventBus? That will require configuring the event stream via $wgEventStreams. What's the preferred approach, should that happen in the extension or in mediawiki-config?

I'm having a hard time to describe, but these events feel a bit too transactional, almost like an RPC. I looked over our schema guidelines and they are at a more concrete level, not detailing the higher level theory of how to model the events for our use case. I'm thinking though that an event should be more of a description of the state of its world, rather than a request to do a specific thing. The specific things to do should be derived from the state of its world. The purpose here would be to make events that are generally useful to consumers not yet conceived.

There is a bit in the guidelines about conventions around state changes. This is certainly a change of state, but it's not clear if that form would be best here. I suppose at the core of it I'm thinking there should be a single event type that describes the state of link recommendations for a page, even if it's partial state, vs multiple events for specific state changes.

These are actual events though: add happens when a maintenance script generates recommendations, invalidate happens when a user acts on a recommendation (which can involve making an edit to the page, but doesn't have to). Is it a naming issue? add could be something like evaluate (to better resemble score), maybe? invalidate could be called something like submit? (accept/reject wouldn't work since a single recommendation contains multiple links, and the user could accept some and reject some in the same action.) We want to invalidate in situations other than the user accepting/rejecting, when the page has changed and recommendation is not current anymore, but we can probably use the normal MediaWiki index update mechanism for that.

That will require configuring the event stream via $wgEventStreams. What's the preferred approach, should that happen in the extension or in mediawiki-config?

mediawiki-config please! I need to write a How To for EventBus events. For now, here's the EventLogging based one. The steps described there for stream config apply for EventBus too.

Hm, @Tgr, Do you really need multiple schemas here? It looks like both schemas are just fragment/mediawiki/revision/common with a recommendation_type field added. IIUC, you are trying to emit 'revision link recommendation opportunity identified & link submitted' events?

I think what you want is a single schema and different streams. Think of a schema as a datatype of a stream. Type is to Variable as Schema is to Event Stream. So, you can declare multiple streams that use the same schema. For example, see the mediawki.revision-create and mediawiki.page-create streams in $wgEventStreams. They both set schema_title: mediawiki/revision/create.

Naming here does seem hard. Maybe one schema mediawiki/revision/recommendation/change and 2 streams: mediawiki.revision-recommendation-link-(solicit|requested|????) and mediawiki.revision-recommendation-link-(create|submit|add).

Some Qs, just because I'm curious.

  • If someone adds a link, that will result in a new revision, right? Do we want to capture that information in the 'submit' event stream?
  • Once a recommendation has been made, does it have some kind of ID? How do you know once a user submits a link that that submission is associated with that previous recommendation? Can there be more than one recommendation event for a given revision? Perhaps in this mediawiki/revision/recommendation/change schema you can do as Eric suggested and capture state change information? Maybe capturing both a list of outstanding_recommendations and maybe fulfililed_recomendations, and then also the change via a prior_state field with the same fields?

Anyway, I'm only gathering bits and pieces of what you are trying to do. Making data model conventions is hard because the use cases are always so varied. I'd be happy to help brainstorm this with you in a quick meeting sometime if that would help!

IIUC, you are trying to emit 'revision link recommendation opportunity identified & link submitted' events?

Yeah, although "identified" is maybe misleading in the sense that at this point the recommendation service has already been invoked and the response validated as being good enough for the use case (e.g. the number of links it recommended for the page was high enough), and cached in MediaWiki, so this notification is just for the search index. Pragmatically, it's "recommendation cache updated" but that didn't seem like an elegant way to describe an event.

Do you really need multiple schemas here? It looks like both schemas are just fragment/mediawiki/revision/common with a recommendation_type field added.

In the future we might need more fields and those would be almost certainly different. For a newly generated recommendation, we might want to send more information about it that might need to be factored into the search index (e.g. some kind of confidence score, or the number of links recommended). For a user submission, the search logic needs to know whether the submission resulted in an edit (in which case the search index will be updated via the normal mechanism) or not (in which case the update has to be triggered by this event), although we could avoid that by simply not sending any event in the first case. We might also want to record for analytics purposes what decisions the user made, or what new article revision resulted.

If you think we should take a YAGNI approach here and use the same data type until we are sure we'll need something else, I'll update the patch.

Naming here does seem hard. Maybe one schema mediawiki/revision/recommendation/change and 2 streams: mediawiki.revision-recommendation-link-(solicit|requested|????) and mediawiki.revision-recommendation-link-(create|submit|add).

Submit sounds fine; create/add would imply to me that the user accepted at least one link, which is not necessarily the case.

Solicit/request is a bit misleading in that the recommendation was already retrieved and stored by this point. The logic of the cronjob doing this is something like: check whether we have a large enough recommendation pool -> find new candidates based on some criteria about how the pool should be balanced in what kind of articles it contains -> get recommendations from the link recommendation service for the candidates -> check whether it could recommend enough links with high enough confidence; if it could, save the recommendation (the service is deterministic but slow) and message EventGate to update the search index. So maybe the stream name could be something like 'generated' or 'found'?

  • If someone adds a link, that will result in a new revision, right? Do we want to capture that information in the 'submit' event stream?

We might, we haven't really thought about that. The search infrastructure won't care (all we need is to reindex that page to ensure it will not be returned for a "pages with recommendations" search); we'll probably need it for analytics (what would have been EventLogging in the past), but 1) maybe that will be easier to do on the client side, 2) is it appropriate for a schema to represent both a primary and a secondary event?

  • Once a recommendation has been made, does it have some kind of ID?

Not for now, although it did come up in the context of discussing a standardized framework for recommendations in MediaWiki (something that's planned in the long term but not for this project). For our purposes a revision is sufficient for identifying a recommendation. Even a title would be, barring some unlikely race condition scenarios; we don't care so much about past recommendations on the same page that we'd need to be able to reference them. It would of course be easy to generate a unique ID.

How do you know once a user submits a link that that submission is associated with that previous recommendation?

The exact mechanics are TBD (we might rely on the edit happening via a somewhat custom interface, which could mark edit requests with a custom HTTP header or some such), but this identification has to happen within MediaWiki (as we need to apply change tags to these edits), so it's unrelated to the event pipeline.

Can there be more than one recommendation event for a given revision?

No, although that comes down to the somewhat arbitrary decision of what we call a "recommendation". Currently it is a bundle of all links the service recommended for that page. We could call each link a separate recommendation, and then there would be several, but it would be a poor fit for our workflows (the recommendation service generates and returns them as a bundle, we want to show them to the user as a single unit, with a joint change review dialog after accepting/rejecting every one of them; at the end we want to discard all recommendations the user has seen, even if some of them the user did never accepted or rejected (this is a feature for inexperienced users, and anything the user has seen but skipped might be too hard or confusing; just getting a new random recommendation from another page is more likely to be effective). In theory it's not hard to imagine that there could be, even with relatively small changes to the planned workflow, multiple events per revision.

Maybe capturing both a list of outstanding_recommendations and maybe fulfililed_recomendations, and then also the change via a prior_state field with the same fields?

Recomendations aren't stored anywhere in any kind of permanent and referenceable way, so I'm not sure we could achieve much with that.

What is not clear to me is how you reconcile the decision made by the add_link identification algorithm ran by the maint script and the user decision to refuse a recommendation.
What happens if you re-run the maint-script again after the users have refused some recommendations?

Change 643230 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] [WIP] Add EventStream config for link recommendations

https://gerrit.wikimedia.org/r/643230

What is not clear to me is how you reconcile the decision made by the add_link identification algorithm ran by the maint script and the user decision to refuse a recommendation.
What happens if you re-run the maint-script again after the users have refused some recommendations?

We might just ignore it initially, as it is not very likely to happen (the recommendation pool is a small random subset of all wiki articles). In the longer term I imagine we'll end up with some kind of blacklist. That's discussed in T266446: Add Link engineering: Provide a mechanism for storing data about which link recommendations were rejected by the user.

So ideally we'd need two separate channels, one for adding recommendations to the search index (can be batched, currently it happens entirely independent of edits but in the future might be a follow-up event to edits that has too much delay to be included in the main update) and one for removing recommendations from the index (needs to be processed in near real time, happens independent of edits).

Can we just have a single schema and stream, with the list (or just count?) of link recommendations for a revision? If the count is 0 (user has rejected them), the index can be updated accordingly. We could also add a field indicating that the change was made by a user rejections or a the link recommendation services. mediawiki/revision/recommendation-change schema? For future types of recommendations, we could do like mediawiki/revision/score does and use an array of maps types, as long as a generic enough model of a recommendation can be shared between the different types.

Can we just have a single schema and stream, with the list (or just count?) of link recommendations for a revision? If the count is 0 (user has rejected them), the index can be updated accordingly. We could also add a field indicating that the change was made by a user rejections or a the link recommendation services. mediawiki/revision/recommendation-change schema? For future types of recommendations, we could do like mediawiki/revision/score does and use an array of maps types, as long as a generic enough model of a recommendation can be shared between the different types.

That would mean the logic for deciding when to remove a page from the index of pages with recommendations (e.g. when the number of valid recommendations fall below a certain number) would have to happen in the EventGate consumer, which I don't think is an ideal place. It would be better to keep most of the business logic together, in the MediaWiki extension, with the event making it very clear whether an index update is needed. For example, the current business logic is to discard the page from the task pool even if the user didn't accept or reject a single link but skipped all of them. We might change that in the future and instead discard when the number of unreviewed links falls below a certain threshold (we use such a threshold when generating new tasks). Having to update search pipeline code for such changes doesn't seem ideal.

Also, if we do start to log user actions in the future (ie. which links the user accepted/rejected/skipped), we'll need separate schemas anyway as it doesn't really make sense to have that data on events about new recommendations.

That would mean the logic for deciding when to remove a page from the index of pages with recommendations [...] would have to happen in the EventGate consumer

(Just a clarification, EventGate Kafka HTTP Producer proxy. But I get what you mean: the consumer process.)
Isn't this true anyway? With different streams, your consumer is going to have to consume both and know the difference and react accordingly.

if we do start to log user actions in the future (ie. which links the user accepted/rejected/skipped), we'll need separate schemas anyway as it doesn't really make sense to have that data on events about new recommendations.

This makes sense.

with the event making it very clear whether an index update is needed

This might be ok, but I think it isn't the best way to do data modeling for events. As Erik said before, it's better to try and model events as representing what happened, and allowing consumers to react, rather than as commands to do something. This allows them to be more useful to other potential use cases, like historical analysis, or updating some other downstream datastore. Ideally, this event shouldn't be about ElasticSearch at all, it should be a representation of something happening that any interested downstream consumer could use to update its state.

I don't mean that the RPC(ish) data model is bad or not allowed or anything. We should just do our best to model this as an event if we can.

Gergo and I just had a meeting to discuss, and I think I was missing some context and now am all filled in! The event schema being designed in https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/642225 is actually a pretty good event model, its just not adding data that isn't needed for this use case (yet), because future 'recommendation' projects (images, etc.) are still on going and it isn't clear if you can use the same data model for all recommendation types.

So IIUC, the schema should be a mediaiwki/revision/recommendation/create event, which has a recommedation_type field indicating that it is a link recommendation. If/when more data about the actual recommendation is needed, it can be added to the data model then. For now, all this use case needs is to know that a link recommendation for a revision happened.

Gergo said the invalidate event might not be needed after all, but if it is, I think it would be a different this 'recommendation added' one. That event would be more like a 'user invalidated a revision link recommendation', and might belong in a different namespace? Not sure. Or, the 'invalidated' event could be one that is modeled more like a RPC command, since it really would only exist to clear the elasticsearch index.

I think if we focus on the recommendation/create event for now, we can make this move forward. @Tgr ping me when you update the schema patchset and I'll check it out.

:)

Gergo said the invalidate event might not be needed after all

I'll add a more detailed writeup of the plans based on the meetings we had today, but the short version is, we want to have some kind of re-validation logic when the article changes, to see if the recommendation still makes sense; initially, and maybe always for link recommendations, that logic will just be to always discard the recommendation. We can just re-use this mechanism for the case when the user submits/rejects a recommendation; and since search index updates after edits happen within MediaWiki and don't require emitting an event to EventGate, it would be the same for invalidating recommendations - we'd just run SearchUpdate, or something like that.

In the future, we might well want to emit an event when the user processes the recommendation, for analytics or other purposes, but the details for that are not yet clear and we can figure it out when we get there.

Change 643566 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/EventBus@master] Add EventBus method for mediawiki/revision/recommendation/create

https://gerrit.wikimedia.org/r/643566

I think if we focus on the recommendation/create event for now, we can make this move forward. @Tgr ping me when you update the schema patchset and I'll check it out.

Thanks for the help! I have updated the patches.

I wrote down the interaction between the MediaWiki extension and ElasticsSearch in T268803: Add a link engineering: Search pipeline; please correct me if I got something wrong.

Change 642225 merged by Ppchelko:
[schemas/event/primary@master] Add mediawiki/revision/recommendation-create

https://gerrit.wikimedia.org/r/642225

Change 643230 merged by jenkins-bot:
[operations/mediawiki-config@master] Add EventStream config for link recommendations

https://gerrit.wikimedia.org/r/643230

Mentioned in SAL (#wikimedia-operations) [2020-12-02T19:17:12Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:643230|Add EventStream config for link recommendations (T261407)]] (duration: 01m 06s)

Change 643566 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Add EventBus method for mediawiki/revision/recommendation-create

https://gerrit.wikimedia.org/r/643566

Everything needed here is done, I think.

Here's how the event pipeline performed for the first few wikis (the graphs shows the number of search results for hasrecommendation:link, for T-14 days to T-7 days):

task-pool.png (306×917 px, 29 KB)

The orange line is testwiki, which just doesn't have many articles. The blue line is cswiki, which behaved as expected. The other lines (arwiki, bnwiki, viwiki) show a very short growth period, then a long flat period (about a day), then more growth. The maintenance script that creates task writes to the DB and sends the EventGate event at the same time, and while we don't have timeline data for the DB, the number of tasks stored there is 20-30% larger than the number of tasks in the search index, for those three wikis (for cswiki it is equal). That can't be just delay, the task count has since leveled off for all wikis and the discrepancy remained. So it seems like the events during that flat period got lost.

I'll look through later today and see whats going on, I think we still have the inputs and outputs for all stages so it's mostly looking through and finding where things were lost.

The flat spot is explained by the weekly update. Essentially there is a large update that takes ~ a day to load into the elasticsearch servers and these updates end up behind it in the kafka queue. When that update finishes though the remaining messages should still be in kafka and processed by the update daemon (mjolnir).

Plausible solution for the general problem, that the timely updates get stuck behind a big bulk update, can probably be solved by defining a second kafka topic for the hourly updates. I'd have to double check but we might be able to add it to the list of topics subscribed to and it "just works".

For the more specific problem, that the set of delayed updates don't seem to have made it into the search engine, I need to spend more time looking.

It looks like our other problem is that we are uploading files to swift with object_prefix of yyyymmdd and path of hour=hh/part-nn.gz. Looking at swift file listings any given day only has hour=23, suggesting that at least with how we are using swift today it is deleting anything already at that prefix and replacing it with the new part. I'll double check but i don't think the downstream cares what exactly this is, only that a valid url is provided. Will need to move the hour into the main object_prefix i suspect.

In normal operations this wouldn't be as easily noticed, the messages are processed before the next hour has a chance to delete them. But it seems we've been losing hourly updates for about 1 day a week since deploying the hourly updates.

After we fix this i can have it backfill the hourlies of the last few weeks.

Separately need to see why mjolnir didn't complain. Checking the logs for host:search-loader* at a relevant time span dont indicate any failures. The only thing that stands out is they are missing the 'download complete' message.

Change 693205 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] mjolnir bulk daemon: Add topic for hourly updates

https://gerrit.wikimedia.org/r/693205

Change 693231 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Split kafka channel used for weekly and hourly updates

https://gerrit.wikimedia.org/r/693231

Thanks for looking into it! If it's a significant effort to backfill, we can probably do it easier on our side (we write to the database and the search index at the same time, so we can just go through the DB entries not matching the search index, and re-emit the event - that would be T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index).

kostajh triaged this task as High priority.Jun 1 2021, 8:40 PM
kostajh lowered the priority of this task from High to Medium.

Change 697836 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] mjolnir: Stop listening to BC topics

https://gerrit.wikimedia.org/r/697836

Change 693205 merged by Ryan Kemper:

[operations/puppet@production] mjolnir bulk daemon: Add topic for hourly updates

https://gerrit.wikimedia.org/r/693205

Change 698056 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Don't overwrite previous swift uploads

https://gerrit.wikimedia.org/r/698056

Change 693231 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Split kafka channel used for weekly and hourly updates

https://gerrit.wikimedia.org/r/693231

Change 698056 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Don't overwrite previous swift uploads

https://gerrit.wikimedia.org/r/698056

In theory the above should fix the related problems, will need to monitor on wednesday when the weekly job runs if things actually work as expected.

Thanks for looking into it! If it's a significant effort to backfill, we can probably do it easier on our side (we write to the database and the search index at the same time, so we can just go through the DB entries not matching the search index, and re-emit the event - that would be T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index).

It's reasonably easy to re-trigger shipping the updates for a given time range, thinking about it the main problem will be with invalidation. Specifically I worry that if I re-ship the old updates it will turn back on things that should have turned off. Alternatively a reconcilliation process like you've mentioned that can compare and update where state doesn't match is a solution we've also used in CirrusSearch and sounds like it is more likely to result in a correct output.

Reviewing the graphs, it seems we've only fixed one of the two problems. As far as I can tell this is no longer losing the hourly updates while the weekly update runs. On the other hand the hourly updates are still getting blocked behind the weekly update while it runs. A very simple solution would be to run a second copy of the mjolnir daemon, but the puppet end might be less simple to make it run multiple instances, connect prometheus to the extra ports, etc.. The main problem as far as I can tell is that we tell the kafka client to only fetch a single record at a time, required as we have some records that can take 30+ minutes to process. From a short look at docs and related solutions, a typical solution involves checking topic lag before polling for new records, and if any priority topics have lag pause all the non-priority topics.

Change 699080 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/MjoLniR@master] bulk daemon: Pause unprioritized topics when priority topics lag

https://gerrit.wikimedia.org/r/699080

Change 699080 merged by jenkins-bot:

[search/MjoLniR@master] bulk daemon: Pause unprioritized topics when priority topics lag

https://gerrit.wikimedia.org/r/699080

Change 699808 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/MjoLniR/deploy@master] bulk_daemon: Deploy prioritized topics

https://gerrit.wikimedia.org/r/699808

Change 699808 merged by Ebernhardson:

[search/MjoLniR/deploy@master] bulk_daemon: Deploy prioritized topics

https://gerrit.wikimedia.org/r/699808

Mentioned in SAL (#wikimedia-operations) [2021-06-14T21:40:01Z] <ebernhardson@deploy1002> Started deploy [search/mjolnir/deploy@baeee47]: T261407 bulk_daemon: Deploy prioritized topics

Mentioned in SAL (#wikimedia-operations) [2021-06-14T21:40:50Z] <ebernhardson@deploy1002> Finished deploy [search/mjolnir/deploy@baeee47]: T261407 bulk_daemon: Deploy prioritized topics (duration: 00m 49s)

Change 699814 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] mjolnir: Provide prioritized topics to bulk daemon

https://gerrit.wikimedia.org/r/699814

Change 697836 merged by Ryan Kemper:

[operations/puppet@production] mjolnir: Stop listening to BC topics

https://gerrit.wikimedia.org/r/697836

Change 699814 merged by Ryan Kemper:

[operations/puppet@production] mjolnir: Provide prioritized topics to bulk daemon

https://gerrit.wikimedia.org/r/699814

Everything should be deployed now. Will be reviewing grafana monday to see if everything is working. Expecting the number of input files processed per hour to stay more consistent with hourly files processed as they become available, instead of a big spike at the end.

Change 709546 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/MjoLniR@master] bulk_daemon: Sync fetch prioritized topic highwater

https://gerrit.wikimedia.org/r/709546

Prioritization looks to now be working. The consumer group lag chart clearly shows prioritized queues going up and quickly right back down, while the unprioritized job continues to process without blocking the prioritized ones. Lag of updates is increased, sometimes a minute or two up to a peak of ~30min. This data is already from 2 or 3 hours ago, doesn't seem worthwhile to figure out where this occasional lag comes from if the rest of the pipeline isn't trying to be particularly fast anyways. This meets the overall goal of processing hourly data hourly instead of getting backlogged behind the weekly update.