Page MenuHomePhabricator

Calculate ChatGPT plugin response topics: July 11-12
Closed, ResolvedPublic


get the distribution (%) of response topics based on the ORES topics of returned articles, using a sample of logged plugin responses.

examples of how to call the API:

see: Github code search and GLOW topic data-handling and analysis

See slides

Event Timeline

Iflorez edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
Frostly renamed this task from Calculate Chat gpt plug-in response topics to Calculate ChatGPT plugin response topics.Jul 26 2023, 12:28 PM

@MPhamWMF are you interested in looking at particular topics?
If not, I propose:

  • Analysis of breakdown of topics in aggregate: high-level & drill down
  • Analysis of high-level first article topics only: high-level & drill down

High level topic example (this is more accurate):

Drill down topic example:

@Iflorez , No particular topics of interest right now. I would mostly like to see a general overview, so what you propose sounds like a great start. I would also just add that it would be nice to also split on language if possible, though that may only make sense for languages with enough examples (like Japanese)

Preliminary data on topics available in this DRAFT deck.

Requests from @MPhamWMF:

  • Can we add English to the language break out on slide 6 as well? It’s clear that English queries must be highly skewed towards Culture, since that’s overall highest rank, but the other major languages other than French have Geography as the highest ranked topic
  • on slide 6, would it be possible to get graphs with the subtopics as well? i.e. Geography.Regions.Asia? I sorta assume that Japan and China might be driving Asia related queries, but that might not be the case
  • I’d also be interested in running this over a day’s worth of regular on wiki cirrus search results (maybe top 4) as well, to compare against a baseline

Feedback from @Isaac:
main suggestion would be to drop the main topic slide as the higher confidence topics aren't necessarily the more salient topics so much as the easier-to-detect topics

See more of this conversation on Slack


  • make improvements to the base query per the search results validation work done by our plugin. These include:
  • page.ns == 0
  • page.missing is None
  • page.pageprops is None
  • page.fullurl is not None
  • add percentages to the bars
  • update code to remove the manual top three languages sub topic charts...set this up using a loop
  • compare this to non chat-gpt-plug in searches for en only
  • rerun code and output results for the last two weeks of July
  • rerun code and output results for the first two weeks of August
    • compare this to data from the logs for the same dates
  • review base data and the chart on slide five.

Side notes as I make improvements to the cirrussearch_request query:

  • API query could resolve for redirects
  • API query could resolve for missing pages - you could change the api request to use the search request as a generator and have the api tell you if the pages exist in a single request.

Explanation: The existence check happens in


the actual filtering only happens in


which is only used for the web ui.

Re running without updating code given T345119

  • for only 1 week → first 7 days of September
  • update code to remove the manual top three languages sub topic charts
  • compare this to non chat-gpt-plug in searches for en, jp only
spark_session = wmf.spark.create_session(type='yarn-large') 

web_query = '''
WITH search_results AS (
    select  search_id, database,
            element_at(params, 'action') AS params_action,
            CASE WHEN http.request_headers.referer LIKE '' THEN 'web'
                WHEN http.request_headers.referer LIKE '' THEN 'mobile_web'
                WHEN http.request_headers.`user-agent` LIKE 'WikipediaApp%' THEN 'app'
                ELSE 'other' 
            END AS platform
    from event.mediawiki_cirrussearch_request 
    where year        == 2023                                     AND  
          month       == 9                                        AND
          day         IN (1,2,3,4,5,6,7)                                        AND
          --hour        == 1                                        AND 
          database    == 'jawiki'                                 AND
          hits IS NOT NULL                                        AND
          params IS NOT NULL                                      AND  
          element_at(params, 'action') IS NOT NULL                AND
          element_at(params, 'action') IN ('query', 'opensearch') AND
          elasticsearch_requests IS NOT NULL                      AND
          http.method == 'GET'                                    AND
          (http.request_headers.referer LIKE '' 
          OR http.request_headers.referer LIKE ''
          OR http.request_headers.`user-agent` LIKE 'WikipediaApp%')

    languages AS (
    SELECT language_code, database_code
    FROM canonical_data.wikis

    SELECT /*+COALESCE(128) */
            search_id, language_code, page_title, page_id, params_action,
            source, platform
    FROM search_results
    LEFT JOIN languages
    ON database_code = database


Iflorez renamed this task from Calculate ChatGPT plugin response topics to Calculate ChatGPT plugin response topics: July 11-12.Nov 14 2023, 5:29 PM
Iflorez updated the task description. (Show Details)