Page MenuHomePhabricator

Calculate ChatGPT plugin response topics: July 11-12
Closed, ResolvedPublic

Description

get the distribution (%) of response topics based on the ORES topics of returned articles, using a sample of logged plugin responses.

examples of how to call the API: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/topic-classification/querying_for_topics.ipynb

see: Github code search and GLOW topic data-handling and analysis

See slides

Event Timeline

Iflorez edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
Frostly renamed this task from Calculate Chat gpt plug-in response topics to Calculate ChatGPT plugin response topics.Jul 26 2023, 12:28 PM

@MPhamWMF are you interested in looking at particular topics?
If not, I propose:

  • Analysis of breakdown of topics in aggregate: high-level & drill down
  • Analysis of high-level first article topics only: high-level & drill down

High level topic example (this is more accurate):
Geography
Culture

Drill down topic example:
Geography[continent]
Culture.Biography

@Iflorez , No particular topics of interest right now. I would mostly like to see a general overview, so what you propose sounds like a great start. I would also just add that it would be nice to also split on language if possible, though that may only make sense for languages with enough examples (like Japanese)

Update:
Preliminary data on topics available in this DRAFT deck.

Requests from @MPhamWMF:

  • Can we add English to the language break out on slide 6 as well? It’s clear that English queries must be highly skewed towards Culture, since that’s overall highest rank, but the other major languages other than French have Geography as the highest ranked topic
  • on slide 6, would it be possible to get graphs with the subtopics as well? i.e. Geography.Regions.Asia? I sorta assume that Japan and China might be driving Asia related queries, but that might not be the case
  • I’d also be interested in running this over a day’s worth of regular on wiki cirrus search results (maybe top 4) as well, to compare against a baseline

Feedback from @Isaac:
main suggestion would be to drop the main topic slide as the higher confidence topics aren't necessarily the more salient topics so much as the easier-to-detect topics

See more of this conversation on Slack

TODOs:

  • make improvements to the base query per the search results validation work done by our plugin. These include:
  • page.ns == 0
  • page.missing is None
  • page.pageprops is None
  • page.fullurl is not None
  • add percentages to the bars
  • update code to remove the manual top three languages sub topic charts...set this up using a loop
  • compare this to non chat-gpt-plug in searches for en only
  • rerun code and output results for the last two weeks of July
  • rerun code and output results for the first two weeks of August
    • compare this to data from the logs for the same dates
  • review base data and the chart on slide five.

Side notes as I make improvements to the cirrussearch_request query:

  • API query could resolve for redirects
  • API query could resolve for missing pages - you could change the api request to use the search request as a generator and have the api tell you if the pages exist in a single request.

Explanation: The existence check happens in

CirrusSearch\Search\CirrusSearchResult::isMissingRevision

the actual filtering only happens in

MediaWiki\Search\SearchWidgets\FullSearchResultWidget

which is only used for the web ui.

Re running without updating code given T345119

  • for only 1 week → first 7 days of September
  • update code to remove the manual top three languages sub topic charts
  • compare this to non chat-gpt-plug in searches for en, jp only
spark_session = wmf.spark.create_session(type='yarn-large') 

web_query = '''
WITH search_results AS (
    select  search_id, database,
            hits.page_title,
            hits.page_id,
            element_at(params, 'action') AS params_action,
            source,    
            CASE WHEN http.request_headers.referer LIKE 'https://en.wikipedia.org%' THEN 'web'
                WHEN http.request_headers.referer LIKE 'https://en.m.wikipedia.org%' THEN 'mobile_web'
                WHEN http.request_headers.`user-agent` LIKE 'WikipediaApp%' THEN 'app'
                ELSE 'other' 
            END AS platform
    from event.mediawiki_cirrussearch_request 
    where year        == 2023                                     AND  
          month       == 9                                        AND
          day         IN (1,2,3,4,5,6,7)                                        AND
          --hour        == 1                                        AND 
          database    == 'jawiki'                                 AND
          hits IS NOT NULL                                        AND
          params IS NOT NULL                                      AND  
          element_at(params, 'action') IS NOT NULL                AND
          element_at(params, 'action') IN ('query', 'opensearch') AND
          elasticsearch_requests IS NOT NULL                      AND
          http.method == 'GET'                                    AND
          (http.request_headers.referer LIKE 'https://ja.m.wikipedia.org%' 
          OR http.request_headers.referer LIKE 'https://ja.wikipedia.org%'
          OR http.request_headers.`user-agent` LIKE 'WikipediaApp%')
    ),

    languages AS (
    SELECT language_code, database_code
    FROM canonical_data.wikis
    )

    SELECT /*+COALESCE(128) */
            search_id, language_code, page_title, page_id, params_action,
            source, platform
    FROM search_results
    LEFT JOIN languages
    ON database_code = database

 '''

spark_session.sql(web_query).write.saveAsTable('florez.search_results_for_topics_ja')
Iflorez renamed this task from Calculate ChatGPT plugin response topics to Calculate ChatGPT plugin response topics: July 11-12.Nov 14 2023, 5:29 PM
Iflorez updated the task description. (Show Details)