Page MenuHomePhabricator

[Research Engineering Request] Produce regular snapshots of embedding indexes for list-building tool
Open, Needs TriagePublic

Description

Goal

The list-building tool (T348332) depends on two sets of article embeddings (alongside the morelike Search functionality) for generating article recommendations. Currently both of these article embeddings are static and only updated if the research scientist manually re-runs them and adds them to the tool. Automating this process so it is far easier to make monthly updates should be sufficient for most purposes. Note: this task is most similar to T348823 but separate from questions around the actual nearest-neighbor lookup approach, which is covered by T348822.

Why

The list-building tool is being prototyped for organizers to assist in building article worklists and as potentially a way to also find relevant editors to a given campaign. If the embedding indexes are not updated, the tool will be limited to recommending articles that are e.g., at least 6 months old (or whenever its last snapshot was uploaded). This will particularly impact its utility in supporting organizers working on emergent topics / current events.

Engineering required

The updates can be broken down into five stages, for which there are very similar needs for both the reader model and link-based model. Ideally all five stages would be automated, though in practice the final stage of moving them to Cloud VPS is a major challenge given the size of the embeddings and may have to continue to be manual until the tool is moved to an internal server where it's easier to move large files.

Stage 1: Collecting the data

For each model, we need to run a job that gathers the necessary features from Hive tables:

  • Reader model:
    • Data generation code: this code runs through a week of reader data, builds sessions of pages that were read together, and outputs these sessions to flat files.
  • Link-based model:
    • Data generation code: same as T351118, a simple job extracts all the pagelinks for Wikipedia articles and maps them to their respective QIDs. This produces flat files of articles and their respective links.

Stage 2: Train a model for producing the embeddings

Stage 3: Generating the embeddings

  • Reader model:
  • Link-based model:
    • Code: In this case, the articles we want embeddings for are actually documents (and not the words), so we still need to run every article in the data from Stage 1 back through the trained model and output its respective document vector.

Stage 4: Building a nearest-neighbor index

  • Reader model:
    • Again, the current API just uses the fastText model for this but a long-term solution would depend on the same approach as used by the link-based model.
  • Link-based model:
    • Code (but scroll down to Build Annoy index header): once the embeddings have been created, it's relatively simple to build the nearest-neighbor index though it can take a while and in my experience can be done effectively on the stat boxes but not a Cloud VPS instance. The index is a much larger file than the embeddings. Currently the embeddings are about 1.8GB while the index is 20GB.

Stage 5: Move the index to the Cloud VPS instance

For both models, the files are large enough that this essentially must be done by scp-ing the files locally from the relevant stat machine and then from your local machine up to the Cloud VPS instance. See T341582 for another example of web publication not supporting large file downloads. Given that this is a still a prototype and this issue will eventually be solved with actual productization, I'm less concerned about automating this final step (though it certainly would be nice).

Event Timeline

We reviewed this task in the November 21st backlog grooming meeting. The decision is to prioritize it. A few more details:

  • @fkaelin will work on breaking down this task to smaller tasks. Our understanding is that a version of stages 3-5 is currently in progress. Fabian, this is your todo.
  • I'm moving this task to the Q3 column for now given that the engineering unit is at capacity with existing work for the coming 2 weeks. Work can start on this task earlier in this quarter if capacity opens.
  • @Miriam confirmed that this is priority for Isaac and Isaac will open time to work on it with Fabian and other engineers when this is prioritized. Please coordinate for start time.