Goal
The list-building tool (T348332) depends on two sets of article embeddings (alongside the morelike Search functionality) for generating article recommendations. Currently both of these article embeddings are static and only updated if the research scientist manually re-runs them and adds them to the tool. Automating this process so it is far easier to make monthly updates should be sufficient for most purposes. Note: this task is most similar to T348823 but separate from questions around the actual nearest-neighbor lookup approach, which is covered by T348822.
Why
The list-building tool is being prototyped for organizers to assist in building article worklists and as potentially a way to also find relevant editors to a given campaign. If the embedding indexes are not updated, the tool will be limited to recommending articles that are e.g., at least 6 months old (or whenever its last snapshot was uploaded). This will particularly impact its utility in supporting organizers working on emergent topics / current events.
Engineering required
The updates can be broken down into five stages, for which there are very similar needs for both the reader model and link-based model. Ideally all five stages would be automated, though in practice the final stage of moving them to Cloud VPS is a major challenge given the size of the embeddings and may have to continue to be manual until the tool is moved to an internal server where it's easier to move large files.
Stage 1: Collecting the data
For each model, we need to run a job that gathers the necessary features from Hive tables:
- Reader model:
- Data generation code: this code runs through a week of reader data, builds sessions of pages that were read together, and outputs these sessions to flat files.
- Link-based model:
- Data generation code: same as T351118, a simple job extracts all the pagelinks for Wikipedia articles and maps them to their respective QIDs. This produces flat files of articles and their respective links.
Stage 2: Train a model for producing the embeddings
- Reader model:
- Model training code: this code takes the output sessions from stage 1 and trains a fastText unsupervised model on them, which learns embeddings for each article that depends on what other articles it's commonly near in the reading sessions.
- Link-based model:
- In practice, I don't retrain the link-based model that often and we should just use the one currently hosted on LiftWing. If we did retrain, however, then there's code for getting the groundtruth labels, converting the link data from Stage 1 into fastText format and splitting it into train/val/test splits and then actually training the model.
Stage 3: Generating the embeddings
- Reader model:
- This is trivial -- once the model is trained, the embeddings can just be extracted via get_word_vector for each of the articles (words). Currently, the fastText itself is actually shipped to CloudVPS as it supports nearest-neighbor lookups natively but long-term we'll want to incorporate the reader embeddings into the same vector search framework that the link-based model uses.
- Link-based model:
- Code: In this case, the articles we want embeddings for are actually documents (and not the words), so we still need to run every article in the data from Stage 1 back through the trained model and output its respective document vector.
Stage 4: Building a nearest-neighbor index
- Reader model:
- Again, the current API just uses the fastText model for this but a long-term solution would depend on the same approach as used by the link-based model.
- Link-based model:
- Code (but scroll down to Build Annoy index header): once the embeddings have been created, it's relatively simple to build the nearest-neighbor index though it can take a while and in my experience can be done effectively on the stat boxes but not a Cloud VPS instance. The index is a much larger file than the embeddings. Currently the embeddings are about 1.8GB while the index is 20GB.
Stage 5: Move the index to the Cloud VPS instance
For both models, the files are large enough that this essentially must be done by scp-ing the files locally from the relevant stat machine and then from your local machine up to the Cloud VPS instance. See T341582 for another example of web publication not supporting large file downloads. Given that this is a still a prototype and this issue will eventually be solved with actual productization, I'm less concerned about automating this final step (though it certainly would be nice).