It appears that T393663 is back -- see https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#%22United_States%22_in_search_box. For example, on enwiki, the United States article is not coming up until you type the whole name (see https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=United%20Stat&namespace=0&limit=50).
Description
Related Objects
Event Timeline
We can look at the scoring information to get an idea of what is going on: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirruscompsuggestbuilddoc&titles=United%20States%7CUnited%20Kingdom&formatversion=2
From here we can see United Kingdom gets an autocomplete sore of 7,812,221. United States gets an autocomplete score of 3,246,753. Poking through the score_explanation the curious bit for the United States is that it's getting a popularity_weighted score of 0, and an incoming_links_weighted score of 0. Both of those fields are maintained by a secondary process with a once a week batch update, and not through the primary edit workflow.
A few things:
- These fields will auto-populate with the weekly data load. I'm currently poking over the data to make sure there isn't something weird happening there. Basically the issue should fix itself.
- It's not clear how these fields went missing. The only method I'm aware of for this to happen is if the page was deleted, then restored or re-created. But I don't find anything suggesting that is the case.
- It's not clear how many other pages are effected.
Checked the data lake, we have been consistently generating popularity_score data for United States (project=en.wikipedia, page_id=3434750). Checking further down the pipeline I can also find the same page in the bulk update files that we push into elasticsearch. Essentially I'm reasonably certain the updates are still flowing.
Checking for the incoming_links data, basically the same. The incoming_link count on United States was ~1.3M, but no update was shipped because the most recent search index dump had the exact same count already. As far as i can tell the next round will see the missing count and ship an update.
So the expected update pipelines all appear to be working as expected. The question remaining is how did the fields go missing in the first place. I haven't found an answer yet, but will poke around some more.
I've kicked off a re-import of this weeks data, that will re-populate the popularity_score across all pages. As mentioned above this does not have the incoming_links count, only the popularity_score. The weekly data import takes about 30-ish hours to get indexed, then we also have to wait for the daily completion suggester rebuild to pick those up. Optimistically the visibile part of the problem should resolve itself on Thursday.
I've also been pondering how we might detect this kind of issue in the future. Perhaps, during the daily completion suggester rebuild, we could increment a counter every time we notice that we are missing some externally populated fields. We could then monitor that count for a week or two to find the typical range of values and alert whenever we get outside those limits.
General investigation:
- The search index dump taken on June 22 has both fields populated
- The search index dump taken on June 29 has both fields empty
- The _version count, which tracks the number of times the document has been updated since the index was created (oct 2024), is 63 in eqiad and 78 in codfw. UK is at 475, and Main_Page is 6421. Based on the page history this should be far higher, which strengthens the idea that the page was deleted from the search index sometime between June 22 and June 29.
- The clusters each update from the same central stream. With the same issue on both clusters it suggests a delete was sent there.
- I pulled the cirrussearch.update_pipeline.update.v1 events for 2025/06/22 - 2025/06/30, this includes a PAGE_DELETE event at 2025-06-23T04:17:00Z with a request_id of 4a671902-b8f8-4bea-b18c-b2072758007c
- Given reqId maps to an api.php request with trigger page-edit. The revision created is: https://en.wikipedia.org/w/index.php?title=United_States&oldid=1296923565
- This revision accidentally converted the page into a redirect, then moments later another revision fixed it. Redirects are not represented in the search index as their own page, so the page was deleted from the search index and the redirected-to page was updated.
- When reverted we reloaded the data into the search index, but this is missing the batch data that only comes once a week. This is "working as designed", it was known when building out the pipeline that these external pieces of data would not exist for new pages, but it was assumed that was acceptable because for actually new pages the counts would be very close to 0 anyways.
So we know why this happened, and the system is currently working as designed. Perhaps we need to revisit the design to ask how this can be avoided in the future, but given the system will repair itself on the timescale of a week or two, it has worked like this for years, and it's not been an issue in the past I'm wary of investing significant time into that fix unless we see this occuring more often.
There is some potential to address the problem in some circumstances, particularly where the issue is quickly reverted, but it would be a very special-cased solution. Essentially when the update pipeline recieves a request to update a page it will hold that request for about 5 minutes waiting for other events (async ml predictions, etc.) to come in on the same page. Currently the system makes no attempt to merge delete requests, perhaps we could detect that we have events to both delete and un-delete the same page id and skip the deletion.
This seems to be resolved. The newly reconstituted "Unites States" page is showing up in the top ten suggestions for all substrings of united states.
Thanks for digging into it and finding the root cause, @EBernhardson!