Page MenuHomePhabricator

Account creation: getting articles about Argentina, Chile, and Mexico into the Suggested Edits module
Closed, ResolvedPublic

Description

We will need to set up the actual articles to make available under the special "Argentina" / "Chile" / "Mexico" topics for the suggested edits feed. There are a few possible ways we might do this:

  • Via a list of page IDs supplied by Research. Those then need to be wired up to the "GLAM Argentina / Chile / Mexico" topic CTA.
    • option 1: use pageids query with a static list of page IDs (T301030#7710499)
    • option 2: implement a new search keyword and a unique weighted tag prefix (T301671#7708303)
  • By adding the word "Argentina" to the search criteria
    • option 1: naively add the word "argentina" to the free text search. Probably won't work that well once random sorting is included.
    • option 2: use the static list of page titles to create a campaign page, and use a terms lookup query on that. See T301030#7709803.

The event organizers are going to tell us which methodology they prefer based on the coverage and accuracy of each. Once they give us feedback, we'll know how to proceed.

Notes https://docs.google.com/document/d/1AasF148HBJWJc7tUv54uCb10h0QVHz7KdAu9OOcTOhU/edit#heading=h.a2alt3bl27yw

Event Timeline

@Isaac is it possible to include Argentina in the list of articletopic topics https://www.mediawiki.org/wiki/ORES/Articletopic as is done for existing topics? That would be nice for us, because we wouldn't have to change any of our code, we'd just use a search query for articletopic:Argentina.

Can we think of a solution that would scale? I imagine some wikiprojects (or any other group of users) being happy creating their own set of topics. Could be, for instance for Wikipedia Asian Month or similar initiatives.

is it possible to include Argentina in the list of articletopic topics https://www.mediawiki.org/wiki/ORES/Articletopic as is done for existing topics? That would be nice for us, because we wouldn't have to change any of our code, we'd just use a search query for articletopic:Argentina.

@kostajh I was actually going to ask about that myself. I was curious about how the current articletopic keywords are populated and found this wikitech page on the data pipeline. I imagine it would not be desirable to set up the full EventBus etc. streaming pipeline at this moment but the good news is that all the country tags are already on HDFS so I'm hoping it would be relatively simple to write a DAG for uploading them in bulk into elasticsearch like this one. I'm happy to help with getting the right schema / data etc. on the HDFS side but I wouldn't be of much help for the orchestration / load-into-elastic side. I would request that if we do it, we try to do it completely though -- i.e. all wikis + all country tags, which ends up being 35,183,097 article-country pairs across 302 wikis and 253 unique country tags. The current iteration of the model is just some simple repackaging of Wikidata information so there's nothing fancy going on here beyond processing of the Wikidata dumps on HDFS on a monthly basis. I am in the process of extending it out so the data could potentially ~double in size at some point in the future if that goes well.

Can we think of a solution that would scale? I imagine some wikiprojects (or any other group of users) being happy creating their own set of topics.

I don't want to expand the scope of this ticket too much but I am also interested in this broader discussion of the role of these tag pipelines. For example, how they connect with ML infrastructure? whether they are our best practice or just an existing solution? how much else we can add in this way?

To @Trizek-WMF's point, we could imagine connecting these tags to the PageAssessments extension data or the "on focus list of Wikimedia project" property in Wikidata for allowing more ad-hoc, community-generated WikiProject tags to be uploaded. From what I understand, having the full country tags in-place would already be a big step towards supporting more community initiatives but it'd be great to see broader support. I also would like us to add gender data extracted from Wikidata and have some other tags in mind too :)

kostajh triaged this task as Medium priority.Feb 7 2022, 6:56 PM
kostajh raised the priority of this task from Medium to High.Feb 7 2022, 6:58 PM
kostajh added a subscriber: MPhamWMF.

is it possible to include Argentina in the list of articletopic topics https://www.mediawiki.org/wiki/ORES/Articletopic as is done for existing topics? That would be nice for us, because we wouldn't have to change any of our code, we'd just use a search query for articletopic:Argentina.

@kostajh I was actually going to ask about that myself. I was curious about how the current articletopic keywords are populated and found this wikitech page on the data pipeline. I imagine it would not be desirable to set up the full EventBus etc. streaming pipeline at this moment but the good news is that all the country tags are already on HDFS so I'm hoping it would be relatively simple to write a DAG for uploading them in bulk into elasticsearch like this one. I'm happy to help with getting the right schema / data etc. on the HDFS side but I wouldn't be of much help for the orchestration / load-into-elastic side. I would request that if we do it, we try to do it completely though -- i.e. all wikis + all country tags, which ends up being 35,183,097 article-country pairs across 302 wikis and 253 unique country tags. The current iteration of the model is just some simple repackaging of Wikidata information so there's nothing fancy going on here beyond processing of the Wikidata dumps on HDFS on a monthly basis. I am in the process of extending it out so the data could potentially ~double in size at some point in the future if that goes well.

hi Discovery-ARCHIVED and @MPhamWMF, is this something that your team could possibly help us with in the few weeks? Unfortunately we don't have a lot of lead time–ideally the data would be loaded into search by, lets say, March 1 at the latest, because the event we want to have this data for is happening on March 11.

Can we think of a solution that would scale? I imagine some wikiprojects (or any other group of users) being happy creating their own set of topics.

I don't want to expand the scope of this ticket too much but I am also interested in this broader discussion of the role of these tag pipelines. For example, how they connect with ML infrastructure? whether they are our best practice or just an existing solution? how much else we can add in this way?

To @Trizek-WMF's point, we could imagine connecting these tags to the PageAssessments extension data or the "on focus list of Wikimedia project" property in Wikidata for allowing more ad-hoc, community-generated WikiProject tags to be uploaded. From what I understand, having the full country tags in-place would already be a big step towards supporting more community initiatives but it'd be great to see broader support. I also would like us to add gender data extracted from Wikidata and have some other tags in mind too :)

This is an interesting discussion but yes, I think we should move it to a separate, new task, because there are a couple of different questions and stakeholders here.

Through adding the word "Argentina" to the search criteria.

As @Tgr mentioned, this may not be so useful because we randomize the search results (to avoid users seeing the same task at the same time). So, whoever assesses the results for adding "Argentina" to the search keyword field should make sure the ?sort=random query parameter is set, as it is in this URL:

https://es.wikipedia.org/w/index.php?sort=random&search=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox+Argentina&title=Especial:Buscar&profile=advanced&fulltext=1&ns0=1

To assess the results, you'd want to look at the first page of results, then reload the page to get another randomized set, and so on.

To assess the results, you'd want to look at the first page of results, then reload the page to get another randomized set, and so on.

A quick rough look at this: I'm seeing almost all noise where there's a passing mention of Argentina in the article of a very broad topic. So probably not a great fit for this edit-a-thon unless we also have some way of controlling the "amount of evidence" -- i.e. applying a stronger filter for content relevant to the Argentina keyword prior to the random sort. I also tried using the more specific deep category search with Argentina as the base category though I couldn't get that to work because I think the category tree for that is too big. Still the keyword approach might be useful for more specific topics like "skateboarding", which I think was an example raised by Marshall as a good fit for this sort of thing.

Instead of searching for a word, we could search for a category or a template, which would probably yield better results. We could also ask organizers to create a page with the list of articles (which is a common approach to campaigns anyway) and then use reverse link search to limit the search to those articles. IIRC links are already present in the search index; we would have to write a search keyword but that isn't too hard.

Instead of searching for a word, we could search for a category or a template, which would probably yield better results. We could also ask organizers to create a page with the list of articles (which is a common approach to campaigns anyway) and then use reverse link search to limit the search to those articles. IIRC links are already present in the search index; we would have to write a search keyword but that isn't too hard.

@EBernhardson is that feasible? I was thinking of a search keyword linkedfrom:Foo which would turn into a terms lookup query using the outgoing_link field of Foo as path.

Instead of searching for a word, we could search for a category or a template, which would probably yield better results. We could also ask organizers to create a page with the list of articles (which is a common approach to campaigns anyway) and then use reverse link search to limit the search to those articles. IIRC links are already present in the search index; we would have to write a search keyword but that isn't too hard.

The algorithm @Isaac worked on tags about 58,000 articles on eswiki as belonging to the topic of "Argentina". If we look at those page IDs and check which ones have image recommendations and don't have the "infobox" template collection, then we're left with ~120 articles. The organizers of the GLAM event have indicated that is enough for their events (cc @MMiller_WMF @GFontenelle_WMF to confirm that, please).

Given the low number of matches, one way we could implement this is to create a page like MediaWiki:GLAMArgentina.json on eswiki with the list of the page IDs. Clicking the "GLAM Argentina" topic button in suggested edits module (T301028) would get translated into a search query for hasrecommendation:image -hastemplatecollection:infobox pageids:{list-of-page-ids-from-MediaWiki:GLAMArgentina.json.

That way we don't have to involve the Search team in getting this ready in a pretty compressed timeline, nor do we need to write new search keywords, or adjust the search index in any way.

Using randomized search order to avoid task duplication should also work out with this approach.

@Tgr @mewoph @Sgs what do you think?

kostajh renamed this task from Account creation: GLAM event topic list to Account creation GLAM: getting articles about Argentina into the Suggested Edits module.Feb 15 2022, 11:15 AM
kostajh removed a project: Discovery-Search.
kostajh updated the task description. (Show Details)

Given the low number of matches, one way we could implement this is to create a page like MediaWiki:GLAMArgentina.json on eswiki with the list of the page IDs.

Or we could use a wikitext page with a list of links which is friendlier to organizers. It would be slightly more work as we couldn't reuse as much of the community configuration system (which assumes JSON pages) but only slightly.

I think my two preferred approaches are:

  • using a custom weighted tag:
    • get list of articles from Research or campaign organizers
    • run UpdateWeightedTags.php to add a classification.campaign.Argentina (or similar) tag to those articles
    • subclass ArticleTopicFeature with GrowthArticleTopicFeature, in the subclass handle an additional argentina value for articletopic: and map it to the custom tag name
    • add CirrusSearchAddQueryFeatures to HomepageHooks, use that hook to replace ArticleTopicFeature with GrowthArticleTopicFeature (this is the most hacky part as it relies on hooks running in a specific order but there are tricks to make that work)
    • add argentina to our topic list by hardcoding it into NewcomerTasksConfigurationLoader
    • at the end of the campaign run UpdateWeightedTags.php again to delete the tags
  • using a list:
    • ask Research or organizers to make an on-wiki linked list of the target articles
    • write some ConfigurationLoader-style code for loading the list page, parsing it, getting the list of linked titles from the ParserOutput, turning titles to page ids, caching the whole thing with invalidation on edit
    • create a PageListTopic subclass of Topic
    • make NewcomerTasksConfigurationLoader return a PageListTopic using the page above
    • have SearchStrategy resolve that topic into a pageid: query. The boolean subquery limit is 1000 and we'll need a hundred or so for other purposes, plus the infobox filter, so this approach could probably handle a list of up to 500 articles.

The first is significantly less work. The second is less hacky and could be generalized into something that campaign organizers can use without us having to run maintenance scripts every time, plus can handle last-minute changes to the list (without us having to run maintenance scripts).

MMiller_WMF renamed this task from Account creation GLAM: getting articles about Argentina into the Suggested Edits module to Account creation: getting articles about Argentina into the Suggested Edits module.Feb 15 2022, 6:03 PM

I don't think it's a blocker, but just so we don't forget: we don't actually know the list of page IDs that match the three criteria yet (Argentina + imagerec + noinfobox). Right now we have a list of 58,000 page IDs that match Argentina and know there are 40,000 articles in the search index that have imagerecs/noinfobox but we don't know their intersection because the Search API limits us to just seeing the first 10,000 results. The 120 number was an estimate. I assume someone can dump this out of the Search index but that would need to be done for the using a list option that @Tgr detailed and we probably want to do this soonish in case our estimate turns out to be wrong.

This is the Special:Search query: https://es.wikipedia.org/w/index.php?search=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox&title=Especial:Buscar&profile=advanced&fulltext=1&ns0=1&ns100=1&ns104=1
And same thing I think but via the Mediawiki API: https://es.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=500&sroffset=0&srsearch=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox

I don't think it's a blocker, but just so we don't forget: we don't actually know the list of page IDs that match the three criteria yet (Argentina + imagerec + noinfobox). Right now we have a list of 58,000 page IDs that match Argentina and know there are 40,000 articles in the search index that have imagerecs/noinfobox but we don't know their intersection because the Search API limits us to just seeing the first 10,000 results. The 120 number was an estimate. I assume someone can dump this out of the Search index but that would need to be done for the using a list option that @Tgr detailed and we probably want to do this soonish in case our estimate turns out to be wrong.

This is the Special:Search query: https://es.wikipedia.org/w/index.php?search=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox&title=Especial:Buscar&profile=advanced&fulltext=1&ns0=1&ns100=1&ns104=1
And same thing I think but via the Mediawiki API: https://es.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=500&sroffset=0&srsearch=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox

Thanks for thinking that through @Isaac.

I think I am convinced to go with the custom weighted tag as described by @Tgr in T301030#7711615 (and similarly described by Erik from Search in T301671#7708303). Unless there are any objections, let's start with that and see how it looks in practice?

Here is the file with the page IDs:

Instead of searching for a word, we could search for a category or a template, which would probably yield better results. We could also ask organizers to create a page with the list of articles (which is a common approach to campaigns anyway) and then use reverse link search to limit the search to those articles. IIRC links are already present in the search index; we would have to write a search keyword but that isn't too hard.

@EBernhardson is that feasible? I was thinking of a search keyword linkedfrom:Foo which would turn into a terms lookup query using the outgoing_link field of Foo as path.

I don't see any problems there. We do the same query internally to count the number of links pointing at a particular page and it runs all the time.

Change 763109 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add custom ORES topic clone for Argentina campaign

https://gerrit.wikimedia.org/r/763109

Change 763109 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add custom ORES topic clone for Argentina campaign

https://gerrit.wikimedia.org/r/763109

I've +2'ed this patch, the next step would be to run updateWeightedTags after the patch is in production. The earliest time to do that would be on Thursday after group2 deployment or Friday next week (24/25 Feb). We might want to consider backporting the patches so that we can confirm the functionality works as intended in production earlier than that.

Change 763109 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add custom ORES topic clone for Argentina campaign

https://gerrit.wikimedia.org/r/763109

Reposting my comment on T301028#7714619 – would it be feasible and desirable to provide both options listed in this task as separate campaign filters? The more specific list would be the default selected one, but the additional Argentina keyword-generated list could be added as a second chip in the "Campaign topics" category of filters. This could be valuable if folks at the events blitz through the more relevant but smaller list.

Hi @RHo, to confirm for the March campaign, is it the case that only "GLAM Argentina" would be shown and users can’t pick their campaign (only whether to filter based on the campaign group they’re in)?

Screen Shot 2022-02-15 at 1.48.10 PM.png (118×522 px, 28 KB)

Hi @mewoph - my understanding is that participants in the March event(s) will have this filter pre-selected when they create an account, but that they can go ahead and change the filters as normal with any other newcomer. @MMiller_WMF and @kostajh please correct if this is not the case.

One consideration is that this would offer flexibility to have multiple campaign topic filters for the one event, with say GLAM Argentina being the default selected topic, along with additional wider/different filtered sets of topics available for selection – eg., another broader keyword filter could be called something like Argentina, and another one for GLAM LatAm.

kostajh renamed this task from Account creation: getting articles about Argentina into the Suggested Edits module to Account creation: getting articles about Argentina, Chile, and Mexico into the Suggested Edits module.Feb 16 2022, 1:36 PM
kostajh updated the task description. (Show Details)

@Tgr, sorry I missed in the specifications that we will also be tagging articles for Chile and Mexico.

sorry I missed in the specifications that we will also be tagging articles for Chile and Mexico.

@kostajh will this require additional documents with the page IDs for these countries (akin to F34952679)? If so, I'll extract them and upload here. If these events are later and you'd like me to wait to generate the page lists though, that's also fine.

sorry I missed in the specifications that we will also be tagging articles for Chile and Mexico.

@kostajh will this require additional documents with the page IDs for these countries (akin to F34952679)? If so, I'll extract them and upload here. If these events are later and you'd like me to wait to generate the page lists though, that's also fine.

Yes, we will need page IDs for articles that the algorithm says are about Chile and Mexico. Please add them here once you have that. The events are coming up in early April, so maybe we would want to re-run and re-tag closer to the event date, but having the data now would be good for loading into the search index.

Change 763887 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CirrusSearch@master] UpdateWeightedTags: Add batch mode

https://gerrit.wikimedia.org/r/763887

Change 763930 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CirrusSearch@master] DataSender::sendUpdateWeightedTags(): Allow omitting tags

https://gerrit.wikimedia.org/r/763930

Change 763931 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CirrusSearch@master] Support batch updates of weighted tags

https://gerrit.wikimedia.org/r/763931

Yes, we will need page IDs for articles that the algorithm says are about Chile and Mexico. Please add them here once you have that. The events are coming up in early April, so maybe we would want to re-run and re-tag closer to the event date, but having the data now would be good for loading into the search index.

Sounds good. See below:

Chile (28,994 pageids):

Mexico (43,507 pageids):

And for completeness, relinking to Argentina (57,686 pageids):

Change 763887 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] UpdateWeightedTags: Add batch mode

https://gerrit.wikimedia.org/r/763887

Change 763930 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] DataSender::sendUpdateWeightedTags(): Allow omitting tags

https://gerrit.wikimedia.org/r/763930

Commands for uploading the data to the search index:

mwscript extensions/CirrusSearch/maintenance/UpdateWeightedTags.php eswiki --pageid-list chile_eswiki_pageids.txt --tagType classification.oneoff.T301028 --tagName chile --verbose | tee T301028-chile.log
mwscript extensions/CirrusSearch/maintenance/UpdateWeightedTags.php eswiki --pageid-list mexico_eswiki_pageids.txt --tagType classification.oneoff.T301028 --tagName mexico --verbose | tee T301028-mexico.log
mwscript extensions/CirrusSearch/maintenance/UpdateWeightedTags.php eswiki --pageid-list argentina_eswiki_pageids.txt --tagType classification.oneoff.T301028 --tagName argentina --verbose | tee T301028-argentina.log

using a manual copy of the b387a91 version of UpdateWeightedTags.php to spare a backport.
(Will probably take long, does not use DB, can be killed safely if it is causing problems - it logs progress and can be resumed by deleting from the txt file accordingly.)
After the GLAM events are over, cleanup will be via

mwscript extensions/CirrusSearch/maintenance/UpdateWeightedTags.php eswiki --pageid-list chile_eswiki_pageids.txt --tagType classification.oneoff.T301028 --reset --verbose | tee T301028-chile.log
mwscript extensions/CirrusSearch/maintenance/UpdateWeightedTags.php eswiki --pageid-list mexico_eswiki_pageids.txt --tagType classification.oneoff.T301028 --reset --verbose | tee T301028-mexico.log
mwscript extensions/CirrusSearch/maintenance/UpdateWeightedTags.php eswiki --pageid-list argentina_eswiki_pageids.txt --tagType classification.oneoff.T301028 --reset --verbose | tee T301028-argentina.log

Change 765362 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] GrowthArticleTopicFeature: Accept any topic

https://gerrit.wikimedia.org/r/765362

Mentioned in SAL (#wikimedia-operations) [2022-02-24T06:36:09Z] <tgr_> T301030#7734236 running UpdateWeightedTags.php on eswiki

The script is done - took much less time than I expected, and the jobqueue stats didn't budge. In hindsight not really surprising, it inserted 100K jobs over two hours or so, and the baseline Cirrus index write job insertion rate is 5K/min.

Moving to code review for now. One patch is still in progress but that one is not important.

Change 765362 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] GrowthArticleTopicFeature: Accept any topic

https://gerrit.wikimedia.org/r/765362

I will follow up on the pending patch later, but for the purposes of the upcoming campaign, this is QA-able:

(but will only work after the next train lands)