Page MenuHomePhabricator

Investigate what would be required to include countries in ORES and accessible via a search keyword
Open, Needs TriagePublic

Description

ORES topics currently exist for regions, e.g. "central-africa", "east-asia", "western-europe", etc. We use those for topic mappings in the GrowthExperiments-NewcomerTasks feature in GrowthExperiments (see T240517: [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics) for background).

This task is for researching what is needed to include countries in the list of available ORES articletopics. In theory those would then also be queryable via the articletopic search keyword, so this task is also for researching what is needed to make the search integration work.

This task originated out of researching options for getting articles about Argentina in front of newcomers for a campaign (T301030: Account creation: getting articles about Argentina, Chile, and Mexico into the Suggested Edits module
). In T301030, we are figuring out options for doing this that do not involve Discovery-Search, Machine-Learning-Team or ORES; those options will all involve one-off hacks of one kind or another. This task is for figuring out a longer term, more sustainable approach to getting country-as-topics into the ORES/search.

It's valuable to have input from various stakeholders (Growth-Team, Discovery-Search, Research and Machine-Learning-Team) about what is needed to do this, and it is not time-sensitive to collect that input.

The output from this task should be some outline of what we need to do to make this happen and what the next steps might be.

Event Timeline

kostajh added a subscriber: Gehel.

From chatting with @Gehel it sounds like there is not a lot that the Discovery-Search team would need to do once "argentina" exists as an ORES topic; although from looking at https://wikitech.wikimedia.org/wiki/Search/articletopic it seems like a re-processing of wikis would be needed.

We are hoping to be able to query this term in production for an event on March 11 that is for eswiki (T301030), so specifically we would need eswiki to be updated by the beginning of March. Other wikis could be updated later.

@Isaac as I understand your comment in T301030#7690294, your work there wasn't specifically done with the objective of including the output in ORES articletopic. But I'm wondering if it is something that could be stitched into the existing ORES articletopic data – @calbon does that seem possible or realistic to achieve in the next 2-3 weeks?

We talked about this a bit within search platform today. We think one straight forward way to go about this, taking into consideration the time contraints, will be to choose a new unique weighted_tag prefix, load data into it, and query it from a custom keyword.

  • It doesn't seem likely that ores would be updated to emit these predictions alongside it's typical predictions within the time alotted and give time to deploy and verify all the other steps.
  • weighted_tags, the machinery that provides the articletopic keyword, is fairly easy to reuse for multiple purposes. For one-time loading of data the UpdateWeightedTags.php maintenance script in CirrusSearch can be used.
  • A one-off keyword will need to be implemented that queries the new tag. This is mostly boilerplate code to wire it into the parsing process, shouldn't be anything novel.

The hard problems, mostly because there is no particular rule and we have to live with the results:

  • Choose an appropriate prefix for the one-off weighted_tag
  • Choose an appropriate search keyword (what is typed into the search box to make it happen)

I like what @EBernhardson proposed and specifically agree that It doesn't seem likely that ores would be updated to emit these predictions alongside it's typical predictions within the time alotted and give time to deploy and verify all the other steps.

To get a handle on the keyword space, I added the list of the countries that are currently potential labels output from this model below (you can ignore the sub-continent region column for now). My initial suggestion would be to either go with country: or region: as the prefix. The former ("country") is more specific but I sometimes try to avoid because the model is a bit more inclusive than just UN-recognized countries even though that's the starting place. Given that this largely will be used by power-users and tool developers, however, I'm fine with either and don't think we need the perfect word (which doesn't exist).

For the actual keywords, then I assume we'll just have to dessign a config to map the countries to short forms. Two options I can think of:

  • Use the country codes. Most of these are standard though my list includes ~10 regions without country codes. For example, Wales does not have a code (it's considered part of the UK) but I break it out in the model because there's a Welsh Wikipedia so it makes lots of sense that folks might want to focus on Wales. Easy enough fix -- we e.g., just choose "WAL" -- but of course this might not match user expectations and other codes like CXR for the Australian Indian Ocean Territories just aren't readily guessable unless you know it.
  • Use shortened country names -- e.g., wales for Wales, drc for Democratic Republic of Congo, etc. but aim for simplicity rather than adhering to a specific standard. This will take more work and I don't know if it'll be much better than country codes.

Though it's not a specific priority of this sprint, I would like to highlight that switching in the links-based language-agnostic topic classification model for the current English-text-based topic classification model would be beneficial for greatly increasing topic coverage / reducing geographic bias in Spanish Wikipedia and beyond without requiring new keywords / changing any of the existing Growth infrastructure. I'll happily help coordinate this work in another ticket if ML Platform has bandwidth for it.

Region Labels

                                   Canonical Name       Sub-continent Region Country Code
0                  British Indian Ocean Territory             Eastern Africa          IOT
1                                         Burundi             Eastern Africa          BDI
2                                         Comoros             Eastern Africa          COM
3                                        Djibouti             Eastern Africa          DJI
4                                         Eritrea             Eastern Africa          ERI
5                                        Ethiopia             Eastern Africa          ETH
6                     French Southern Territories             Eastern Africa          ATF
7                                           Kenya             Eastern Africa          KEN
8                                      Madagascar             Eastern Africa          MDG
9                                          Malawi             Eastern Africa          MWI
10                                      Mauritius             Eastern Africa          MUS
11                                        Mayotte             Eastern Africa          MYT
12                                     Mozambique             Eastern Africa          MOZ
13                                        Réunion             Eastern Africa          REU
14                                         Rwanda             Eastern Africa          RWA
15                                     Seychelles             Eastern Africa          SYC
16                                        Somalia             Eastern Africa          SOM
17                                    South Sudan             Eastern Africa          SSD
18                                       Tanzania             Eastern Africa          TZA
19                                         Uganda             Eastern Africa          UGA
20                                         Zambia             Eastern Africa          ZMB
21                                       Zimbabwe             Eastern Africa          ZWE
22                                         Angola              Middle Africa          ANG
23                                       Cameroon              Middle Africa          CMR
24                       Central African Republic              Middle Africa          CAF
25                                           Chad              Middle Africa          TCD
26               Democratic Republic of the Congo              Middle Africa          COD
27                              Equatorial Guinea              Middle Africa          GNQ
28                                          Gabon              Middle Africa          GAB
29                          Republic of the Congo              Middle Africa          COG
30                          São Tomé and Príncipe              Middle Africa          STP
31                                        Algeria            Northern Africa          DZA
32                                          Egypt            Northern Africa          EGY
33                                          Libya            Northern Africa          LBY
34                                        Morocco            Northern Africa          MAR
35                                          Sudan            Northern Africa          SDN
36                                        Tunisia            Northern Africa          TUN
37                                 Western Sahara            Northern Africa          ESH
38                                       Botswana            Southern Africa          BWA
39                                       Eswatini            Southern Africa          SWZ
40                                        Lesotho            Southern Africa          LSO
41                                        Namibia            Southern Africa          NAM
42                                   South Africa            Southern Africa          ZAF
43                                          Benin             Western Africa          BEN
44                                   Burkina Faso             Western Africa          BFA
45                                     Cape Verde             Western Africa          CPV
46                                  Cote d'Ivoire             Western Africa          CIV
47                                          Ghana             Western Africa          GHA
48                                         Guinea             Western Africa          GIN
49                                  Guinea-Bissau             Western Africa          GNB
50                                        Liberia             Western Africa          LBR
51                                           Mali             Western Africa          MLI
52                                     Mauritania             Western Africa          MRT
53                                          Niger             Western Africa          NER
54                                        Nigeria             Western Africa          NGA
55   Saint Helena, Ascension and Tristan da Cunha             Western Africa          SHN
56                                        Senegal             Western Africa          SEN
57                                   Sierra Leone             Western Africa          SLE
58                                     The Gambia             Western Africa          GMB
59                                           Togo             Western Africa          TGO
60                                     Antarctica                 Antarctica          ATA
61                                  Bouvet Island                 Antarctica          BVT
62   South Georgia and the South Sandwich Islands                 Antarctica          SGS
63                                     Kazakhstan               Central Asia          KAZ
64                                     Kyrgyzstan               Central Asia          KGZ
65                                     Tajikistan               Central Asia          TJK
66                                   Turkmenistan               Central Asia          TKM
67                                     Uzbekistan               Central Asia          UZB
68                                          China               Eastern Asia          CHN
69                                      Hong Kong               Eastern Asia          HKG
70                                          Japan               Eastern Asia          JPN
71                                          Macau               Eastern Asia          MAC
72                                       Mongolia               Eastern Asia          MNG
73                                    North Korea               Eastern Asia          PRK
74                                    South Korea               Eastern Asia          KOR
75                                         Taiwan               Eastern Asia          TWN
76                                         Brunei         South-eastern Asia          BRN
77                                       Cambodia         South-eastern Asia          KHM
78                                      Indonesia         South-eastern Asia          IDN
79                                           Laos         South-eastern Asia          LAO
80                                       Malaysia         South-eastern Asia          MYS
81                                        Myanmar         South-eastern Asia          MMR
82                                    Philippines         South-eastern Asia          PHL
83                                      Singapore         South-eastern Asia          SGP
84                                       Thailand         South-eastern Asia          THA
85                                    Timor-Leste         South-eastern Asia          TLS
86                                        Vietnam         South-eastern Asia          VNM
87                                    Afghanistan              Southern Asia          AFG
88                                     Bangladesh              Southern Asia          BGD
89                                         Bhutan              Southern Asia          BTN
90                                          India              Southern Asia          IND
91                                           Iran              Southern Asia          IRN
92                                       Maldives              Southern Asia          MDV
93                                          Nepal              Southern Asia          NPL
94                                       Pakistan              Southern Asia          PAK
95                                      Sri Lanka              Southern Asia          LKA
96                                        Armenia               Western Asia          ARM
97                                     Azerbaijan               Western Asia          AZE
98                                        Bahrain               Western Asia          BHR
99                                         Cyprus               Western Asia          CYP
100                                       Georgia               Western Asia          GEO
101                                          Iraq               Western Asia          IRQ
102                                        Israel               Western Asia          ISR
103                                        Jordan               Western Asia          JOR
104                                        Kuwait               Western Asia          KWT
105                                       Lebanon               Western Asia          LBN
106                                          Oman               Western Asia          OMN
107                                     Palestine               Western Asia          PSE
108                                         Qatar               Western Asia          QAT
109                                  Saudi Arabia               Western Asia          SAU
110                                         Syria               Western Asia          SYR
111                                        Turkey               Western Asia          TUR
112                          United Arab Emirates               Western Asia          ARE
113                                         Yemen               Western Asia          YEM
114                                       Belarus             Eastern Europe          BLR
115                                      Bulgaria             Eastern Europe          BGR
116                                Czech Republic             Eastern Europe          CZE
117                                       Hungary             Eastern Europe          HUN
118                                       Moldova             Eastern Europe          MDA
119                                        Poland             Eastern Europe          POL
120                                       Romania             Eastern Europe          ROU
121                                        Russia             Eastern Europe          RUS
122                                      Slovakia             Eastern Europe          SVK
123                                       Ukraine             Eastern Europe          UKR
124                                 Åland Islands            Northern Europe          ALA
125                                       Denmark            Northern Europe          DNK
126                                       England            Northern Europe
127                                       Estonia            Northern Europe          EST
128                                 Faroe Islands            Northern Europe          FRO
129                                       Finland            Northern Europe          FIN
130                                      Guernsey            Northern Europe          GGY
131                                       Iceland            Northern Europe          ISL
132                                       Ireland            Northern Europe          IRL
133                                   Isle of Man            Northern Europe          IMN
134                                        Jersey            Northern Europe          JEY
135                                        Latvia            Northern Europe          LVA
136                                     Lithuania            Northern Europe          LTU
137                              Northern Ireland            Northern Europe
138                                        Norway            Northern Europe          NOR
139                                      Scotland            Northern Europe
140                                        Sweden            Northern Europe          SWE
141                                         Wales            Northern Europe
142                                United Kingdom            Northern Europe          GBR
143                                       Albania            Southern Europe          ALB
144                                       Andorra            Southern Europe          AND
145                        Bosnia and Herzegovina            Southern Europe          BIH
146                                       Croatia            Southern Europe          HRV
147                                     Gibraltar            Southern Europe          GIB
148                                        Greece            Southern Europe          GRC
149                                         Italy            Southern Europe          ITA
150                                        Kosovo            Southern Europe
151                                         Malta            Southern Europe          MLT
152                                    Montenegro            Southern Europe          MNE
153                               North Macedonia            Southern Europe          MKD
154                                      Portugal            Southern Europe          PRT
155                                    San Marino            Southern Europe          SMR
156                                        Serbia            Southern Europe          SRB
157                                      Slovenia            Southern Europe          SVN
158                                         Spain            Southern Europe          ESP
159                                  Vatican City            Southern Europe          VAT
160                                       Madeira            Southern Europe          
161                                        Azores            Southern Europe          
162                                       Austria             Western Europe          AUT
163                                       Belgium             Western Europe          BEL
164                                        France             Western Europe          FRA
165                                       Germany             Western Europe          DEU
166                                 Liechtenstein             Western Europe          LIE
167                                    Luxembourg             Western Europe          LUX
168                                        Monaco             Western Europe          MCO
169                                   Netherlands             Western Europe          NLD
170                                   Switzerland             Western Europe          CHE
171                                      Anguilla                  Caribbean          AIA
172                           Antigua and Barbuda                  Caribbean          ATG
173                                         Aruba                  Caribbean          ABW
174                                      Barbados                  Caribbean          BRB
175                                        Belize                  Caribbean          BLZ
176             Bonaire, Saint Eustatius and Saba                  Caribbean
177                                Cayman Islands                  Caribbean          CYM
178                                          Cuba                  Caribbean          CUB
179                                       Curaçao                  Caribbean
180                                      Dominica                  Caribbean          DMA
181                            Dominican Republic                  Caribbean          DOM
182                                       Grenada                  Caribbean          GRD
183                                    Guadeloupe                  Caribbean          GLP
184                                         Haiti                  Caribbean          HTI
185                                       Jamaica                  Caribbean          JAM
186                                    Martinique                  Caribbean          MTQ
187                                    Montserrat                  Caribbean          MSR
188                                   Puerto Rico                  Caribbean          PRI
189                              Saint Barthélemy                  Caribbean          BLM
190                         Saint Kitts and Nevis                  Caribbean          KNA
191                                   Saint Lucia                  Caribbean          LCA
192                                  Saint Martin                  Caribbean          MAF
193              Saint Vincent and the Grenadines                  Caribbean          VCT
194                                  Sint Maarten                  Caribbean
195                                   The Bahamas                  Caribbean          BHS
196                           Trinidad and Tobago                  Caribbean          TTO
197                      Turks and Caicos Islands                  Caribbean          TCA
198                  United States Virgin Islands                  Caribbean          VIR
199                        British Virgin Islands                  Caribbean          VGB
200                                    Costa Rica            Central America          CRI
201                                   El Salvador            Central America          SLV
202                                     Guatemala            Central America          GTM
203                                      Honduras            Central America          HND
204                                        Mexico            Central America          MEX
205                                     Nicaragua            Central America          NIC
206                                        Panama            Central America          PAN
207                                     Argentina              South America          ARG
208                                       Bolivia              South America          BOL
209                                        Brazil              South America          BRA
210                                         Chile              South America          CHL
211                                      Colombia              South America          COL
212                                       Ecuador              South America          ECU
213                              Falkland Islands              South America          FLK
214                                 French Guiana              South America          GUF
215                                        Guyana              South America          GUY
216                                      Paraguay              South America          PRY
217                                          Peru              South America          PER
218                                      Suriname              South America          SUR
219                                       Uruguay              South America          URY
220                                     Venezuela              South America          VEN
221                                       Bermuda           Northern America          BMU
222                                        Canada           Northern America          CAN
223                                     Greenland           Northern America          GRL
224                     Saint Pierre and Miquelon           Northern America          SPM
225                      United States of America           Northern America          USA
226                                     Australia  Australia and New Zealand          AUS
227           Australian Indian Ocean Territories  Australia and New Zealand          CXR
228                                   New Zealand  Australia and New Zealand          NZL
229                                Norfolk Island  Australia and New Zealand          NFK
230                                          Fiji                  Melanesia          FJI
231                                 New Caledonia                  Melanesia          NCL
232                              Papua New Guinea                  Melanesia          PNG
233                               Solomon Islands                  Melanesia          SLB
234                                       Vanuatu                  Melanesia          VUT
235                Federated States of Micronesia                 Micronesia          FSM
236                                          Guam                 Micronesia          GUM
237                                      Kiribati                 Micronesia          KIR
238                              Marshall Islands                 Micronesia          MHL
239                                         Nauru                 Micronesia          NRU
240                      Northern Mariana Islands                 Micronesia          MNP
241                                         Palau                 Micronesia          PLW
242                                American Samoa                  Polynesia          ASM
243                                  Cook Islands                  Polynesia          COK
244                              French Polynesia                  Polynesia          PYF
245                                          Niue                  Polynesia          NIU
246                              Pitcairn Islands                  Polynesia          PCN
247                                         Samoa                  Polynesia          WSM
248                                       Tokelau                  Polynesia          TKL
249                                         Tonga                  Polynesia          TON
250                                        Tuvalu                  Polynesia          TUV
251          United States Minor Outlying Islands                  Polynesia          UMI
252                             Wallis and Futuna                  Polynesia          WLF
kostajh raised the priority of this task from High to Needs Triage.Feb 14 2022, 8:50 PM

(Changing the priority since this task has multiple teams on it, and while our team has researching this as a high priority item, we don't expect it to be for other teams.)

kostajh renamed this task from Include Argentina (and any other individual countries that are ready to include) in ORES topic modeling to Investigate what would be required to include countries in ORES articletopic modeling.Feb 14 2022, 8:53 PM
kostajh updated the task description. (Show Details)

We only need Argentina for the March 11 deadline, not all countries, right?

A one-off keyword will need to be implemented that queries the new tag. This is mostly boilerplate code to wire it into the parsing process, shouldn't be anything novel.

The simple but hacky solution would be to subclass ArticleTopicFeature, add Argentina to the topic list, special-case it so that the custom tag is used instead of classification.ores.articletopic, then use the CirrusSearchAddQueryFeatures hook to replace ArticleTopicFeature with the subclass. This could be done in GrowthExperiments and discarded after the end of the campaign.

kostajh renamed this task from Investigate what would be required to include countries in ORES articletopic modeling to Investigate what would be required to include countries in ORES and accessible via a search keyword.Feb 15 2022, 11:20 AM
kostajh updated the task description. (Show Details)
kostajh updated the task description. (Show Details)

We talked about this a bit within search platform today. We think one straight forward way to go about this, taking into consideration the time contraints, will be to choose a new unique weighted_tag prefix, load data into it, and query it from a custom keyword.

Thank you for these notes, @EBernhardson.

  • It doesn't seem likely that ores would be updated to emit these predictions alongside it's typical predictions within the time alotted and give time to deploy and verify all the other steps.

Ack.

  • weighted_tags, the machinery that provides the articletopic keyword, is fairly easy to reuse for multiple purposes. For one-time loading of data the UpdateWeightedTags.php maintenance script in CirrusSearch can be used.

That is good to know, thanks.

  • A one-off keyword will need to be implemented that queries the new tag. This is mostly boilerplate code to wire it into the parsing process, shouldn't be anything novel.

Right.

The hard problems, mostly because there is no particular rule and we have to live with the results:

  • Choose an appropriate prefix for the one-off weighted_tag
  • Choose an appropriate search keyword (what is typed into the search box to make it happen)

If we went this route, we would likely revert our code in a couple of months (in favor of something more sustainable), is there any teardown work that has to happen to remove the custom weighted tag / search keyword?

We only need Argentina for the March 11 deadline, not all countries, right?

That's right. But I made this task to have a discussion about what a longer-term, more general solution would involve. I've updated the task description to reflect that.

A one-off keyword will need to be implemented that queries the new tag. This is mostly boilerplate code to wire it into the parsing process, shouldn't be anything novel.

The simple but hacky solution would be to subclass ArticleTopicFeature, add Argentina to the topic list, special-case it so that the custom tag is used instead of classification.ores.articletopic, then use the CirrusSearchAddQueryFeatures hook to replace ArticleTopicFeature with the subclass. This could be done in GrowthExperiments and discarded after the end of the campaign.

That could work, yeah. I've added some comments in T301030#7710499 and updated the task description there, so we can keep this task about a more general, long term solution, and T301030 can be focused on a one-off approach for this specific GLAM campaign. Sorry for the confusion with the unclear tasks.

Since this is about a long-term solution, should we move it out of the Growth sprint board and table for later?

Since this is about a long-term solution, should we move it out of the Growth sprint board and table for later?

Yes, that makes sense.

@EBernhardson I want to re-invigorate this task as I have started to make progress on a model for assigning countries to articles (WE 2.1.1) and am in the early stages of asking that this model be hosted on LiftWing (T371897) so I think it's a good point to restart the conversation about how to incorporate these predictions into the Search index and when that might be possible.

Some changes since 2022 when this task was created:

  • We are now looking at a long-term solution as opposed to supporting a one-off event.
  • The countries vocabulary is fixed (250 values): https://gitlab.wikimedia.org/repos/movement-insights/canonical-data/-/blob/main/country
  • I lean pretty strongly towards incorporating these new countries into the existing articletopic: tag. Even though I think the country and other topic predictions will be coming from separate models on LiftWing, they are functionally the same as the other articletopic keywords from the user-perspective and so I see no reason to maintain them under separate tags. If this causes technical headaches though, we can always adjust.

What I assume this would entail:

  • From the technical standpoint, I think the documentation on LiftWing -> Search Index is a little outdated but my understanding is that this would require two main things:
    • Updating the existing articletopic stream or creating a new stream to query LiftWing and bring the predictions to Search (I've flagged this to ML Platform too as I think usually they own the streams).
    • Updating the mapping between the more verbose country names that are outputted by the model -- e.g., American Samoa or possibly even something like Oceania.Polynesia.American Samoa -- and whatever we decide is the proper tag value on Search -- e.g., as or asm or american-samoa.
  • A one-time additional piece of work could also occur if we decide to do a full refresh of the articletopic Search index. The existing articletopic model has monthly bulk dumps of predictions prepared now (T351118#9983135) and I could easily prepare a dump of all the country predictions from this new model (example). The goal here would be two-fold:
    • Flush out any really old predictions. When we switched from the old ORES model to the new language-agnostic articletopic model in July 2023 (T328276), we didn't flush the index and have let edits slowly refresh it. That's generally fine but it would be nice to fully refresh the indices for articles that are edited very infrequently (and so still have the old ORES predictions in place).
    • And more importantly, fill up the index with the new country tags. This is important for usability (getting the results you expect) and also I intend for the model to actually make use of the country topics in the Search index for making predictions. Essentially, if an article X e.g., links to a bunch of other articles about Japan, it's reasonable to assume that X also is about Japan. Because the Search index will have these predictions "cached", it's a useful place to gather info about all the links in an article (example).
  • Finally, while the goal of the new country model is to replace the old regional articletopic tags -- i.e. western-europe and everything else under Geography -- there will almost certainly be a period of time where our tooling (I think mainly Growth's Newcomer Tasks) is still using the old vocabulary. For that reason, I think we want to leave those tags in place until Growth can update their UIs and we feel comfortable deprecating them. Then we could either flush out the old regional tags or at least stop adding new ones.

Streaming ingestion and query time concerns:

  • In terms of ingestion we are currently defining a new generic method for shipping weighted tags (ml predictions, etc) to the search engine over kafka in T366253. This should simplify ongoing ingestion, but doesn't handle the bulk ingestion use case.
    • We will be migrating topic predictions to this new model, so they will all be the same.
  • Incorperating these new tags into the articletopic: keyword shouldn't be a problem. On our side each model is ingested under a separate tag prefix. At query time we can have the articletopic keyword query multiple prefixes and it should be invisible from the end-user perspective.
  • The exact form of the tags can vary a bit. What happens today is there is a constant mapping in the php query-time search code from 'books' => 'Culture.Media.Books' and so on. This can be expanded to include whatever mapping is decided on for the region prediction model. This is currently a 1-to-1 mapping, but could potentially be changed if required.

One-time bulk updates:

  • Since this is limited to article's and doesn't include the other namespaces I'm not expecting the total volume of updates to be a significant concern. We might have to rate limit it in such a way that it takes a week or two to load all of them in though, depending on the number of pages to be updated.
  • In terms of flushing out old predictions the most complete way to do this would be to define a new tag prefix for the topic models to ensure we only have new data on the defined prefix. We would want to do the streaming ingestion to both the old and the new prefix, then start a bulk update against the new tag prefix. Once the bulk update is complete we can switch the query-time code to use the new tag prefix and setup something to purge the old tag prefix out of the index over time.
  • We don't have a specific plan yet around how exactly the bulk update would work. Current idea would be to emit predictions from the bulk dump to eventgate with ratelimiting and ingest them on the normal update pipeline. To reduce the total number of updates applied to the search engine we would want to produce the bulk updates for the same pages at approximately the same time for both the topic and region models. Plausibly this would be something like joining the two datasets in spark, writing out a few hundred files, and then stepping through the files emitting one at a time with some ratelimit (hundreds of files allows it to not have to start over from the beginning if something goes wrong over a week or two of rate-limited event generation).

Shifting queries from topic to region modeling:

  • This shouldn't be much of a problem and i think it can entirely be done in the php query time code. We could initially issue a boolean query for either the region or the topic model in cases where both models are represented, and turn off the topic model queries at some point in the future.

@EBernhardson thanks for all these notes! Sounds like no major blockers then and we can work out the specific details of tags/flushing/etc. when the model is up on LiftWing. Bulk update definitely falls in the nice-to-have category at the moment so if that's not ready when the stream turns on, that's probably okay. But also it should be easy for me to prepare the files in spark if we go that route.