Page MenuHomePhabricator

Newcomer tasks: investigate discrepancy between topics in module and API
Closed, ResolvedPublic

Description

In T242400: Newcomer tasks: ambassadors test morelike, @Dyolf77_WMF noticed that he was seeing somewhat worse topic matches coming out of the suggested edits module in beta than he saw over the API in production.

During some tests of more like in ar betawiki using the SE module, I found that the results, are about 1 to 3 points less than the tests in production made as explained in the task. For example, when selecting the topic arts, the SE is suggesting 2 to 5 right articles (tested many times). cc @MMiller_WMF

Probably some default search option gets set differently when you use CirrusSearch via the API vs. direct PHP call. If you switch the gtdebug flag on in the API, it will return some score debug URLs, we can try looking at those. Although the way those URLs are built currently for local search is a bit fragile.

This is important to look into so that we are offering strong topic matches during the period in which we're using morelike.

Event Timeline

@Tgr -- I made this separate task so that you and @Dyolf77_WMF can look into this. The ambassadors will also be testing the module in production this weekend in T243026, so we might hear more on this front.

Tgr added a comment.Jan 16 2020, 11:51 PM

Here are two example queries (topic=arts, 10 results): production, beta

Comparing those:

$ test_growthtasks() { curl -s "https://$1/w/api.php?action=query&format=json&generator=growthtasks&utf8=1&ggttopics=arts&ggtlimit=10&ggtdebug=1" | jq --raw-output '.query.pages[].title' | sort; }; 
$ test_growthtasks ar.wikipedia.org
ابن البيطار
ابن غطوس
الوطن العربي
تذهيب الكتب
جلال أمين صالح
خط الرقعة
خط كوفي
غزة
لغات أمازيغية
محمد حسني (خطاط)
$ test_growthtasks ar.wikipedia.beta.wmflabs.org
ابن البيطار
ابن غطوس
الوطن العربي
تذهيب الكتب
جلال أمين صالح
خط الرقعة
خط كوفي
غزة
لغات أمازيغية
محمد حسني (خطاط)
$ diff <(test_growthtasks ar.wikipedia.org) <(test_growthtasks ar.wikipedia.beta.wmflabs.org)

there's no difference.

I don't remember exactly when this was reported so maybe we changed something since then.

@Dyolf77_WMF do you have an example from the API sandbox where beta and production give different results for the same query?

@Tgr I used the SE module in beta to look for the suggestions not the API, and found that the results are quiet different from the production (API). Today when I went back to beta to check again, the SE module didn't show topic selection. So I tried with API in beta and the result seems as good as in API production.
Now, after testing in production as in T243026, I found the results not so good as in API test, for example when selecting Arts, yes the list above you found is good (9/10) but the SE module is showing less good results: (4/10).
I mean the results from API and SE module are different.

Tgr added a comment.Jan 19 2020, 2:36 AM

The module uses more random results (roughly identical to using the API with gglimit=250 instead of 10, ie. top 250 results instead of top 10), to avoid all users who select the same topics getting the same tasks.

Thanks for the explanation @Tgr

The module uses more random results (roughly identical to using the API with gglimit=250 instead of 10, ie. top 250 results instead of top 10), to avoid all users who select the same topics getting the same tasks.

I think this comment resolves the task, but moving over to QA for @Etonkovidova and @MMiller_WMF to look as well.

The module uses more random results (roughly identical to using the API with gglimit=250 instead of 10, ie. top 250 results instead of top 10), to avoid all users who select the same topics getting the same tasks.

I re-checked with couple of topics and although, of course, evaluating relevance is subjective, when the number of articles is the returned set is limited - the suggested articles from betalabs and from production match entirely.

MMiller_WMF closed this task as Resolved.Mon, Feb 3, 11:47 PM
MMiller_WMF claimed this task.

Yes, the explanation about how production selects the top 250 does explain the discrepancy. Our work on this task is finished.