Page MenuHomePhabricator

Explore mapdata API log to estimate prevalence of maps and usage
Closed, ResolvedPublic0 Estimated Story Points

Description

API requests are logged on the analytics server, in the mediawiki_api_request table. Check what we can learn about maps from reading this table.

  • What's the time span of this data?
    • Roughly 90 days
  • Does it cover all wikis?
    • Yes.
  • How much load is on this endpoint?
    • A surprising amount—Nov 1 saw 3.5M requests to the endpoint. This should be the subject of a new investigation: T295602
  • How long does the request take to serve?
    • P50: 27ms, P70: 41ms, P95: 287ms
    • So, usually very fast.
  • Do we see errors?
    • Errors are visible, but the only time this happened was when I made test API calls with both revids and titles parameters
    • This probably means that we're not seeing the API errors for a missing group. Split into a new task: T295604
  • Do requests mostly come from browsers or from kartotherian?
    • Vast majority are kartotherian. There are some from another bot (TBD document this fact). Possibly none from browsers.
  • Any evidence of caching?
    • No API caching, but some kartotherian caching. However, kartotherian will request a page many times per day, for example this article has maybe 800 pageviews, but kartotherian made 8 mapdata requests.
    • This is still an unexpected result, and should be discussed in the above task (T295602). If images are cached for only a short time, it changes our varnish plans.

Event Timeline

Braindump some knowledge about that:

Typical superset-presto-hive query:

select
  count(*) as sample_size,
  count(api_error_codes) as error_count,
--  histogram(coalesce(api_error_codes)) as errors_by_code,
--  count_if(
--    element_at(http.request_headers, 'user-agent') = 'kartotherian (yurik @ wikimedia)'
--  ) as is_kartotherian
--  count_if(is_wmf_domain) as is_wmf_domain,
--  histogram(database) as database
--  histogram(element_at(params, 'titles')) as page_title,
--  approx_percentile(backend_time_ms, 0.50) as backend_time_ms_p50,
--  approx_percentile(backend_time_ms, 0.70) as backend_time_ms_p70,
--  approx_percentile(backend_time_ms, 0.95) as backend_time_ms_p95
-- TODO: params.mpdgroups underscore vs wikivoyage group names
from event.mediawiki_api_request
  where year=2021
  and month=11
  and day=1
--  and hour=1
  and element_at(params, 'action') = 'query'
  and element_at(params, 'prop') = 'mapdata'
--  and element_at(params, 'titles') = 'Verrazzano-Narrows Bridge'
;

Sample of distribution across wikis, from Nov 1, from 1-2am UTC (N=131997): P17738

This shows that roughly half of requests are for enwiki. 112 distinct wikis appear in the sample.

Trying to query the page titles but superset sqllab is crashing even on the simple and limited query:

with titles as (
select
  element_at(params, 'titles') as title
from event.mediawiki_api_request
  where year=2021
  and month=11
  and day=1
  and hour=1
  and element_at(params, 'action') = 'query'
  and element_at(params, 'prop') = 'mapdata'
  limit 100
)

SELECT 
  count(distinct title)
from titles;
awight moved this task from Doing to Sprint Backlog on the WMDE-TechWish-Sprint-2021-11-10 board.

Sample of distribution across wikis, from Nov 1, from 1-2am UTC (N=131997): P17738

@awight What is this showing? Is this already enough to point to wikis with frequent Kartographer views? For example ja.wiki seems to really stand out as a high number, though de wikivoyage is really low. So maybe I'm misinterpreting this or shouldn't be looking at this number at all? Sorry if these questions are premature.

Sample of distribution across wikis, from Nov 1, from 1-2am UTC (N=131997): P17738

@awight What is this showing? Is this already enough to point to wikis with frequent Kartographer views? For example ja.wiki seems to really stand out as a high number, though de wikivoyage is really low. So maybe I'm misinterpreting this or shouldn't be looking at this number at all? Sorry if these questions are premature.

I probably shouldn't have dumped the raw data here without more explanation! I would say there are some open questions about how we want to normalize the numbers, by pageview, by wiki article count, etc. This particular set of data is especially tricky because it's measuring backend API requests coming *from* Kartotherian, so it's only very loosely related to how many humans are requesting pages. For example, I looked at one particular page and there were 8 API requests, but roughly 800 pageviews in the same time span. The data so far only really shows us where maps are *not*, and gives us a surprising insight into how the caching works at the moment.

I'm slowly realizing that the only user-relevant information we can get from the API logs is to count unique titles where maps appeared. So we can say * certain wikis have no maps activity, and * for the wikis with activity, N titles of M titles viewed contained a map.

awight updated the task description. (Show Details)

Remaining questions are moved to T295675.

WMDE-Fisch changed the point value for this task from 5 to 0.
awight claimed this task.