Page MenuHomePhabricator

Add Link: refreshLinkRecommendations.php does not write to the search index on beta
Closed, ResolvedPublic

Description

tgr@deployment-deploy01:~$ mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php cswiki --verbose --topic media > ~/log.txt
tgr@deployment-deploy01:~$ ack 'success, updating index' log.txt | sort
    checking candidate Assassin's_Creed_(film)... success, updating index
    checking candidate Baltic_Song_Contest_2016... success, updating index
    checking candidate Digital_Mobile_Radio... success, updating index
    checking candidate Ed_Sheeran... success, updating index
    checking candidate František_Fuka... success, updating index
    checking candidate Hendrik_Duryn... success, updating index
    checking candidate Jiří_Helekal... success, updating index
    checking candidate Kateřina_Chroboková... success, updating index
    checking candidate Křížek_(znak)... success, updating index
    checking candidate LA4... success, updating index
    checking candidate Lukáš_Ladra... success, updating index
    checking candidate M.I.A.... success, updating index
    checking candidate Předstírání... success, updating index
    checking candidate Quédate_Conmigo... success, updating index
    checking candidate Rudolf_Janíček... success, updating index
    checking candidate Sasser... success, updating index
    checking candidate William_Regal... success, updating index

tgr@deployment-deploy01:~$ sql cswiki
MariaDB [cswiki]> select convert(cast(page_title as binary) using utf8) from page join growthexperiments_link_recommendations on page_id = gelr_page order by page_title;
+------------------------------------------------+
| convert(cast(page_title as binary) using utf8) |
+------------------------------------------------+
| Assassin's_Creed_(film)                        |
| Baltic_Song_Contest_2016                       |
| Digital_Mobile_Radio                           |
| Ed_Sheeran                                     |
| Emma_Hewitt                                    |
| František_Fuka                                 |
| Hans_Weiss                                     |
| Hendrik_Duryn                                  |
| Jan_Raszka                                     |
| Jaroslav_Malina_(antropolog)                   |
| Jiří_Helekal                                   |
| Kateřina_Chroboková                            |
| Křížek_(znak)                                  |
| LA4                                            |
| Lukáš_Ladra                                    |
| M.I.A.                                         |
| Postavy_seriálu_Ulice                          |
| Předstírání                                    |
| Quédate_Conmigo                                |
| Robert_Artur_Pierug                            |
| Rudolf_Janíček                                 |
| Sasser                                         |
| Satyricon                                      |
| Uwe_Büschken                                   |
| Vladimír_Novák_(judista)                       |
| William_Regal                                  |
+------------------------------------------------+
26 rows in set (0.01 sec)


tgr@deployment-deploy01:~$ curl -s 'https://cs.wikipedia.beta.wmflabs.org/w/api.php?format=json&formatversion=2&action=query&list=search&srlimit=1000&srsearch=hasrecommendation:link' | jq --raw-output '.query.search[].title' | sort
ARMAN
Austin a Ally (1. řada)
Emma Hewitt
Felicia Dunaf
Hans Weiss
I2P
Ich (album)
Jan Raszka
Jaroslav Malina (antropolog)
Letecké palivo
Lokomotiva ČS8
Marcelo Burlon
Michail Ivanovič Glinka
Postavy seriálu Ulice
R5 (hudební skupina)
Ricardo Chavira
Robert Artur Pierug
Satyricon
Sentinelština
Serious games
Seznam postav seriálu Včelka Mája
Tepna Náchod
Tomasz Adamek
Uwe Büschken
Vladimír Novák (judista)
Zemětřesení v Myanmaru 2016
Zemní plyn na Ukrajině

The DB matches (some entries were there before running the script), the search index is from a different world. UpdateWeightedTags.php seems to work correctly (eventually; it takes something like a quarter hour though), so it's probably not an issue with the search infrastructure as they use the same CirrusSearch method internally.

Event Timeline

@Tgr have you tried rebuilding the search index from scratch? (Or was that done recently, I can't remember) Can we view the logs of the ElasticSearch application on beta cluster?

Tagging Discovery-Search team for visibility though we'll see if we can figure this out ourselves.

ORES uses the same search index fields and importOresTopics.php works (and setting hasrecommendation flags with UpdateWeightedTags.php works also) so this can't be an index issue, and in general seems unlikely to be a search infrastructure issue: those three scripts use the same mechanism for setting tags.
(I found it surprising how slowly setting tags works - it takes something like half an hour. But that's unliklely to be related to this issue, although it does complicate debugging.)

This was caused by a bug in CirrusSearchIndexUpdater where the revision ID was used instead of the page ID. Since I imported a bunch of single-revision pages in my local setup, I ended up with a test site where that still produced reasonable-looking results.

Change 673766 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Fix CirrusSearchIndexUpdater

https://gerrit.wikimedia.org/r/673766

Change 673766 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Fix CirrusSearchIndexUpdater

https://gerrit.wikimedia.org/r/673766

This now seems to be working correctly (although with a 30-40 minutes delay which I don't understand the reason for, but that doesn't cause any problems and actually makes beta behave more like production wrt index updates).