Calculate ChatGPT plugin response topics: July 11-12
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Iflorez
	Jul 25 2023, 11:49 PM

Description

get the distribution (%) of response topics based on the ORES topics of returned articles, using a sample of logged plugin responses.

examples of how to call the API: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/topic-classification/querying_for_topics.ipynb

see: Github code search and GLOW topic data-handling and analysis

See slides

Related Objects

Mentioned In: T351244: Output ChatGPT plugin title topics: first week of September data
T345119: Fix API querying and related function bugs in ChatGPT plugin
Mentioned Here: T345119: Fix API querying and related function bugs in ChatGPT plugin

Event Timeline

Iflorez created this task.Jul 25 2023, 11:49 PM

Iflorez triaged this task as High priority.Jul 26 2023, 12:00 AM

Iflorez edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

MPhamWMF subscribed.Jul 26 2023, 12:10 AM

Frostly renamed this task from Calculate Chat gpt plug-in response topics to Calculate ChatGPT plugin response topics.Jul 26 2023, 12:28 PM

Iflorez updated the task description. (Show Details)Jul 27 2023, 5:05 PM

@MPhamWMF are you interested in looking at particular topics?
If not, I propose:

Analysis of breakdown of topics in aggregate: high-level & drill down
Analysis of high-level first article topics only: high-level & drill down

High level topic example (this is more accurate):
Geography
Culture

Drill down topic example:
Geography[continent]
Culture.Biography

@Iflorez , No particular topics of interest right now. I would mostly like to see a general overview, so what you propose sounds like a great start. I would also just add that it would be nice to also split on language if possible, though that may only make sense for languages with enough examples (like Japanese)

Iflorez updated the task description. (Show Details)Jul 28 2023, 6:23 PM

Iflorez moved this task from Doing_Future_Audiences to Radar on the User-Iflorez board.Jul 28 2023, 8:11 PM

Iflorez moved this task from Radar to Doing_Future_Audiences on the User-Iflorez board.

Update:
Preliminary data on topics available in this DRAFT deck.

FYI: Slack feedback

Requests from @MPhamWMF:

Can we add English to the language break out on slide 6 as well? It’s clear that English queries must be highly skewed towards Culture, since that’s overall highest rank, but the other major languages other than French have Geography as the highest ranked topic
on slide 6, would it be possible to get graphs with the subtopics as well? i.e. Geography.Regions.Asia? I sorta assume that Japan and China might be driving Asia related queries, but that might not be the case
I’d also be interested in running this over a day’s worth of regular on wiki cirrus search results (maybe top 4) as well, to compare against a baseline

Feedback from @Isaac:
main suggestion would be to drop the main topic slide as the higher confidence topics aren't necessarily the more salient topics so much as the easier-to-detect topics

See more of this conversation on Slack

TODOs:

make improvements to the base query per the search results validation work done by our plugin. These include:
page.ns == 0
page.missing is None
page.pageprops is None
page.fullurl is not None

add percentages to the bars
update code to remove the manual top three languages sub topic charts...set this up using a loop
compare this to non chat-gpt-plug in searches for en only
rerun code and output results for the last two weeks of July
rerun code and output results for the first two weeks of August
- compare this to data from the logs for the same dates
review base data and the chart on slide five.

Maryana added a project: Future-Audiences.Aug 9 2023, 9:45 PM

MPhamWMF moved this task from Backlog to In progress on the Future-Audiences board.Aug 11 2023, 2:38 PM

Side notes as I make improvements to the cirrussearch_request query:

API query could resolve for redirects
API query could resolve for missing pages - you could change the api request to use the search request as a generator and have the api tell you if the pages exist in a single request.

Explanation: The existence check happens in

CirrusSearch\Search\CirrusSearchResult::isMissingRevision

the actual filtering only happens in

MediaWiki\Search\SearchWidgets\FullSearchResultWidget

which is only used for the web ui.

EBernhardson subscribed.Aug 24 2023, 7:58 PM

Iflorez mentioned this in T345119: Fix API querying and related function bugs in ChatGPT plugin.Aug 29 2023, 12:11 AM

FYI: DRAFT analysis

MPhamWMF updated the task description. (Show Details)Sep 7 2023, 1:42 AM

MJL subscribed.Oct 3 2023, 5:53 PM

Re running without updating code given T345119

for only 1 week → first 7 days of September
update code to remove the manual top three languages sub topic charts
compare this to non chat-gpt-plug in searches for en, jp only

spark_session = wmf.spark.create_session(type='yarn-large') 

web_query = '''
WITH search_results AS (
    select  search_id, database,
            hits.page_title,
            hits.page_id,
            element_at(params, 'action') AS params_action,
            source,    
            CASE WHEN http.request_headers.referer LIKE 'https://en.wikipedia.org%' THEN 'web'
                WHEN http.request_headers.referer LIKE 'https://en.m.wikipedia.org%' THEN 'mobile_web'
                WHEN http.request_headers.`user-agent` LIKE 'WikipediaApp%' THEN 'app'
                ELSE 'other' 
            END AS platform
    from event.mediawiki_cirrussearch_request 
    where year        == 2023                                     AND  
          month       == 9                                        AND
          day         IN (1,2,3,4,5,6,7)                                        AND
          --hour        == 1                                        AND 
          database    == 'jawiki'                                 AND
          hits IS NOT NULL                                        AND
          params IS NOT NULL                                      AND  
          element_at(params, 'action') IS NOT NULL                AND
          element_at(params, 'action') IN ('query', 'opensearch') AND
          elasticsearch_requests IS NOT NULL                      AND
          http.method == 'GET'                                    AND
          (http.request_headers.referer LIKE 'https://ja.m.wikipedia.org%' 
          OR http.request_headers.referer LIKE 'https://ja.wikipedia.org%'
          OR http.request_headers.`user-agent` LIKE 'WikipediaApp%')
    ),

    languages AS (
    SELECT language_code, database_code
    FROM canonical_data.wikis
    )

    SELECT /*+COALESCE(128) */
            search_id, language_code, page_title, page_id, params_action,
            source, platform
    FROM search_results
    LEFT JOIN languages
    ON database_code = database

 '''

spark_session.sql(web_query).write.saveAsTable('florez.search_results_for_topics_ja')

Iflorez renamed this task from Calculate ChatGPT plugin response topics to Calculate ChatGPT plugin response topics: July 11-12.Nov 14 2023, 5:29 PM

Iflorez mentioned this in T351244: Output ChatGPT plugin title topics: first week of September data.Nov 14 2023, 5:31 PM

Iflorez closed this task as Resolved.Feb 2 2024, 10:57 PM

Iflorez updated the task description. (Show Details)

Calculate ChatGPT plugin response topics: July 11-12Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Calculate ChatGPT plugin response topics: July 11-12
Closed, ResolvedPublic
Actions