Page MenuHomePhabricator

Investigate static map image request log for usable data
Closed, ResolvedPublic0 Estimated Story Points

Description

Since the API log is behind a layer of cache, it isn't giving us all of the information we were hoping for. The following questions would be better answered by checking the static image request log, which will show us every pageview on both dynamic and static map configurations:

All results are extremely rough estimates extrapolated from a sample hour or day around 2021-11-01. The hour had N=2.3M and the day had N=61M maps.wikimedia.org requests, I chose based on how much data each query could handle without timing out or running out of memory.

Must have

  • Can we estimate what percentage of pageviews include maps, and average number of maps per mappy/non-mappy pageview?
    • Table in T295675#7506319 . English Wikipedia has 0.05 maps per pageview, English Wikivoyage has > 1 map per pageview.
  • How many unique page titles are requested?
    • Can only be answered for maps with overlays.
    • I've included a synthetic number to give us a relative idea of the detected "title diversity", proportional to how many unique titles we see per thousand pageviews.
  • How do mappy page titles compare to unique page titles requested on each wiki?
    • Unique page titles viewed outside of maps are not a simple statistic to extract and the outcome is questionably useful. I'll decline this for now.

Nice to have

  • What proportion of map requests include a mapdata overlay?
    • 10% of requests have mapdata overlays.
    • by wiki: see table in T295675#7506319
  • What proportion of requests have a domain or referer (so can be traced to a wiki)?
    • 35% have no referer (due to secured browser settings)
    • 90% have no domain (so have no overlay)
    • 31% have neither referer nor domain
  • How many of the requests are from a known WMF wiki vs a third party?
    • Are we successfully serving those?
    • Declined, there's some kind of problem with external servers making a lot of failing requests to us but should be split into a new investigation.
  • What's the typical request volume?
    • > 700 requests per second.
  • What's the varnish cache hit rate for maps?
    • 75% hit rate
  • How performant is the endpoint?
    • The median is 0.3s, 95th percentile at 3s.
  • How prevalent are mobile app, desktop, and mobile web?
    • 75% desktop view, 25% mobile web view, 0.1% mobile app
  • How prevalent are errors?
    • > 4% of requests error with a 4xx status.

Event Timeline

First pass:

select
  count(*) as sample_size,
  histogram(cache_status),
  histogram(http_status),
  histogram(access_method),
  histogram(agent_type)
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
  and hour = 1;

During this hour, there were 2.3M static map image requests.
1.6M images were served from Varnish cache, and 600k images were generated by the backend.
Roughly 100k requests returned 4xx errors. 1.6M requests were served as status 200 (actually sent the data) and 600k requests as status 304 (client cache sufficed).
Almost all 2.3M requests were served to web clients, and only 3k were served to a mobile app.
2.2M requests were from users, and 100k were web spiders.

Performance stats:

select
  count(*) as sample_size,
  cache_status,
  approx_percentile(time_firstbyte, 0.50) as time_firstbyte_p50,
  approx_percentile(time_firstbyte, 0.70) as time_firstbyte_p70,
  approx_percentile(time_firstbyte, 0.95) as time_firstbyte_p95  
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
  and hour = 1
group BY 
  cache_status;

-> truncated for clarity:

95751	int-front	0.0001	0.0001	0.0002
1060387	hit-front	0.0002	0.0002	0.0003
542485	hit-local	0.0012	0.0039	1.9999
49256	pass		0.1681	0.2514	3.0624
551351	miss		0.3021	0.4873	3.1874

Summary: cache requests are usually served immediately. The median generated map takes 0.3s, and 95th percentile is about 3s. Cache hit rate is around 75%.

Unique titles requested for each wiki, over a random 24-hour period:

select
  count(*) as sample_size,
  url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'domain') as domain,
  count(distinct url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'title')) as unique_titles
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
--  and hour = 1
group BY 
  url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'domain')
order by
  unique_titles desc
;

Unfortunately, the bulk of the requests don't have overlays so parsing the URL doesn't reveal the source wiki.

sample_size	domain	unique_titles
54891029		NULL
402631	uk.wikipedia.org	203431
2564183	en.wikipedia.org	147990
108525	commons.wikimedia.org	83413
966182	ja.wikipedia.org	46199
329747	ar.wikipedia.org	32347
325767	fr.wikipedia.org	31274
245731	zh.wikipedia.org	25638
69184	ca.wikipedia.org	25034
71995	no.wikipedia.org	19339
243425	pt.wikipedia.org	13688
12946	en.turkcewiki.org	12835
70037	tr.wikipedia.org	12786
53509	cs.wikipedia.org	10960
25160	en.wikivoyage.org	10810
146504	it.wikipedia.org	10443
38032	ko.wikipedia.org	9891
8691	zh.100ke.info	8440
50324	he.wikipedia.org	7546
26197	ro.wikipedia.org	6519
99338	ru.wikipedia.org	5311
13511	fi.wikipedia.org	4537
6568	eu.wikipedia.org	4393
19583	bn.wikipedia.org	4069
8238	arz.wikipedia.org	3893
8856	de.wikivoyage.org	3719
4296	nn.wikipedia.org	3334
24167	el.wikipedia.org	3191
30276	th.wikipedia.org	3157
13682	sk.wikipedia.org	3043
37551	fa.wikipedia.org	2898
2752	wiki.bairuo.top	2733
50152	vi.wikipedia.org	2665
2458	en.wikipediam.org	2443
2677	cy.wikipedia.org	2164
10542	bg.wikipedia.org	2076
2127	vec.wikipedia.org	1962
48714	es.wikipedia.org	1855
1828	en-wiki.kfd.me	1813
1850	tt.wikipedia.org	1514
1432	wiwiwiki.kfd.me	1426
1431	wiki.kfd.me	1326
6607	sv.wikipedia.org	1292
2631	fr.wikivoyage.org	1275
7328	tl.wikipedia.org	1241
1266	www.turkcewiki.org	1236
6983	hy.wikipedia.org	1133
1072	wikidaily.org	1070
2410	et.wikipedia.org	937
2664	ur.wikipedia.org	866
...

Digging into the huge number of requests with no domain parameter (in other words, not map image requests):

select
  count(*) as sample_size,
  regexp_like(uri_path, 'osm.*png$') as is_map
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
  and hour = 1
  and url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'domain') is null
group BY 
  regexp_like(uri_path, 'osm.*png$')
;

1.8M are map image requests. 241k are other types of request: map marker images, wikidata geoshape queries, etc.

We can infer some information from the "referer" header, but unfortunately lose the ability to count unique titles, and the data is often missing. Here's what we get:

select
  count(*) as sample_size,
  regexp_replace(url_extract_host(referer), '\.m') as normalized_referrer_host
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
  and hour = 1
  and regexp_like(uri_path, 'osm.*png$') -- is map image
group BY 
  regexp_replace(url_extract_host(referer), '\.m')
order BY 
  sample_size desc
;
779252	NULL
574277	en.wikipedia.org
259972	ja.wikipedia.org
44298	pt.wikipedia.org
40580	geohack.toolforge.org
40246	maps.wikimedia.org
34267	ru.wikipedia.org
27192	zh.wikipedia.org
26201	fr.wikipedia.org
25347	www.relive.cc
21048	ar.wikipedia.org
20643	commons.wikimedia.org
18128	ko.wikipedia.org
18000	holoviews.org
17778	es.wikipedia.org
13291	www.eup.tw
11591	vi.wikipedia.org
7734	it.wikipedia.org
6288	en.wikivoyage.org
3796	fusohorariomundial.com.br
3755	www.jotur.com.br
3635	tr.wikipedia.org
3573	project-gc.com
3571	uk.wikipedia.org
3561	www.wikidata.org
3215	no.wikipedia.org
3070	wikimap.toolforge.org
2360	th.wikipedia.org
2237	maps.wikilovesmonuments.org
2107	ca.wikipedia.org
1951	localhost
1925	www.google.com
1907	m.wikidata.org
1869	m.taxity.pl
1547	wiwosm.toolforge.org
1483	tl.wikipedia.org
1096	cs.wikipedia.org
1064	sv.wikipedia.org
...

While we're here, let's see how many requests come from mobile web vs. desktop web:

select
  count(*) as sample_size,
  regexp_like(url_extract_host(referer), '\.m\.') as is_mobile
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
  and hour = 1
group BY 
  regexp_like(url_extract_host(referer), '\.m\.')
;

1.8M are from desktop web, 480k are from mobile web.

While we're here, let's see how many requests come from mobile web vs. desktop web:

select
  count(*) as sample_size,
  regexp_like(url_extract_host(referer), '\.m\.') as is_mobile
from wmf.webrequest
WHERE 
  uri_host = 'maps.wikimedia.org'
  and year = 2021
  and month = 11
  and day = 1
  and hour = 1
group BY 
  regexp_like(url_extract_host(referer), '\.m\.')
;

1.8M are from desktop web, 480k are from mobile web.

And what are the endpoints being called?

with maps_request as (
select
  regexp_extract(uri_path, '(geo.*|marker|leaflet|osm(?=.*png))', 1) as type
FROM wmf.webrequest
where
  year = 2021
  and month = 11
  and day = 1
  and hour = 1
  and uri_host = 'maps.wikimedia.org'
)

select
  count(*) as sample_size,
  type
from maps_request
group BY 
  type
order by
  sample_size desc
;
2058265	osm map tiles
87554	geoshape
63921	geoline
36513	static marker images
30222	other...
22755	leaflet static classfiles

Roughly 90% of requests are for map tiles, 7% are geoshapes and geolines, and the remainder are static resources.

Putting together some of the above to get a granular look,

with map_request as (
    select
        coalesce(
            url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'domain'),
            nullif(regexp_replace(url_extract_host(referer), '\.m'), ''),
            'unknown'
        ) as normalized_host,
        regexp_like(url_extract_host(referer), '\.m\.') as is_mobile,
        url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'title') as title,
        url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'title') != '' as has_mapdata
    from wmf.webrequest
    where
        uri_host = 'maps.wikimedia.org'
        and regexp_like(uri_path, 'osm.*png$')
        and not regexp_like(http_status, '4..')
        and year = 2021
        and month = 11
        and day = 1
        and hour = 1
),

summary as (
    select
        normalized_host,
        count(*) as pageviews,
        count(distinct title) as detected_titles,
        count_if(is_mobile) as num_mobile,
        count_if(has_mapdata) as num_mapdata
    from map_request
    group by
        normalized_host
),

all_views as (
    select
        domain_name
        sum(view_count) as total_views
    from wmf.projectview_hourly
    join canonical_data.wikis
        on concat(wikis.language_code, '.', wikis.database_group) = project
    where
        year = 2021
        and month = 11
        and day = 1
        and hour = 1
    group by
        domain_name
)

select
    normalized_host,
    pageviews,
    round(cast(pageviews as double) / total_views, 4) as maps_per_pageview,
    round(100.0 * num_mobile / pageviews, 1) as percent_mobile,
    round(100.0 * num_mapdata / pageviews, 1) as percent_overlay,
    -- Rough factor suggesting how many unique titles are seen per pageview.
    round(1000.0 * detected_titles / pageviews) as title_diversity
from summary
left join all_views
    on domain_name = normalized_host
order by
    pageviews desc
limit 70
;
normalized_host	pageviews	maps_per_pageview	percent_mobile	percent_overlay	title_diversity
unknown	725386		0.0	0.0	0
en.wikipedia.org	587484	0.0459	46.5	19.8	66
ja.wikipedia.org	263447	0.1478	22.6	18.1	64
pt.wikipedia.org	45423	0.0777	71.6	28.5	108
geohack.toolforge.org	40580		0.0	0.0	0
maps.wikimedia.org	39980		0.0	0.0	0
ru.wikipedia.org	34578	0.0632	21.1	3.5	19
zh.wikipedia.org	30344	0.0266	35.9	41.7	203
fr.wikipedia.org	27928	0.0401	40.1	19.3	127
ar.wikipedia.org	23773	0.092	76.5	30.5	216
commons.wikimedia.org	21322		30.7	10.6	101
ko.wikipedia.org	18941	0.1045	23.1	13.4	99
es.wikipedia.org	17938	0.0118	68.6	10.6	31
vi.wikipedia.org	11797	0.0441	21.5	23.0	84
it.wikipedia.org	8431	0.023	66.3	21.8	162
uk.wikipedia.org	6819	0.092	25.0	59.7	573
en.wikivoyage.org	6555	1.0711	21.9	12.2	104
www.wikidata.org	4752		0.0	51.2	0
no.wikipedia.org	4518	0.1161	11.7	37.5	355
tr.wikipedia.org	4381	0.0449	45.7	29.1	252
ca.wikipedia.org	3816	0.1144	44.7	56.6	549
wikimap.toolforge.org	3070		0.0	0.0	0
th.wikipedia.org	2554	0.0294	56.4	49.8	284
maps.wikilovesmonuments.org	2237		0.0	0.0	0
localhost	1951		0.0	0.0	0
cs.wikipedia.org	1794	0.0273	27.1	50.3	469
m.wikidata.org	1672		0.0	0.0	0
tl.wikipedia.org	1574	0.1362	53.0	24.6	164
wiwosm.toolforge.org	1547		0.0	0.0	0
bn.wikipedia.org	1315	0.058	72.7	56.4	445
he.wikipedia.org	1168	0.0119	28.2	94.3	872
sv.wikipedia.org	1148	0.0139	7.0	10.2	82
ro.wikipedia.org	980	0.0312	47.0	50.3	489
fa.wikipedia.org	801	0.0075	80.4	39.3	330
nn.wikipedia.org	792	0.093	0.5	14.6	145
simple.wikipedia.org	761	0.0299	23.9	10.2	92
el.wikipedia.org	710	0.0189	63.9	50.0	451
ps.wikipedia.org	670	0.5142	95.8	4.2	42
eu.wikipedia.org	659	0.0554	41.3	59.0	577
de.wikivoyage.org	612	0.2399	19.3	53.9	529
fr.wikivoyage.org	523	0.5903	22.0	10.1	96
bg.wikipedia.org	510	0.0201	88.2	14.5	129
mr.wikipedia.org	492	0.0404	90.7	34.8	201
te.wikipedia.org	491	0.0474	90.8	14.3	118
ml.wikipedia.org	397	0.0333	97.5	14.1	93
arz.wikipedia.org	394	0.0228	15.0	99.5	962
fi.wikipedia.org	320	0.0066	10.9	100.0	953
hr.wikipedia.org	252	0.0127	89.3	19.8	151
ur.wikipedia.org	230	0.0182	63.0	50.0	474
ms.wikipedia.org	221	0.0043	55.2	75.1	534
hi.wikipedia.org	214	0.0035	89.3	42.5	215
sk.wikipedia.org	210	0.0036	58.6	51.9	505
ja.wikivoyage.org	188		0.0	13.3	128
hy.wikipedia.org	187	0.0123	13.9	98.9	781
nl.wikipedia.org	185	0.0014	97.8	3.2	32
vec.wikipedia.org	169	0.0663	0.6	97.6	970
he.wikivoyage.org	150	0.2846	8.0	74.7	720
zh.wikivoyage.org	132	0.0432	71.2	35.6	318
az.wikipedia.org	131	0.0046	54.2	61.8	565
lv.wikipedia.org	110	0.0113	60.0	40.9	400
be.wikipedia.org	96	0.0058	0.0	100.0	969
tt.wikipedia.org	94	0.0133	1.1	90.4	883
cy.wikipedia.org	85	0.0151	3.5	98.8	965
query.wikidata.org	68		0.0	0.0	0
es.wikivoyage.org	67	0.1276	56.7	23.9	194
et.wikipedia.org	67	0.0045	6.0	100.0	985
wikidata.beta.wmflabs.org	64		0.0	0.0	0
mk.wikipedia.org	60	0.0051	51.7	60.0	500
hi.wikivoyage.org	56	0.9825	96.4	10.7	71
zh-yue.wikipedia.org	56	0.0051	25.0	100.0	821
awight updated the task description. (Show Details)
awight set the point value for this task to 3.
awight set Final Story Points to 3.
awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2021-11-10 board.

Major gap in what we've done so far: dynamic maps cause a large number of tile loads, which would have been picked up in the above metrics as "static map load with no overlays"

We differentiate the map views in these groups:

  1. dynamic map tile: https://maps.wikimedia.org/osm-intl/11/327/791.png
  2. static map without overlay: /img/osm-intl,6,53.383333,-1.466667,300x400.png?revid=2
  3. static map with overlay: img/osm-intl,6,53.383333,-1.466667,300x400.png?lang=en&domain=dev.wiki.local.wmftest.net&title=Kartographer&groups=_a2d0fad8e810596ffa8acec436f972b1f77b6521&revid=2
with map_request as (
    select
        coalesce(
            url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'domain'),
            nullif(regexp_replace(url_extract_host(referer), '\.m'), ''),
            'unknown'
        ) as normalized_host,
        regexp_like(url_extract_host(referer), '\.m\.') as is_mobile,
        not regexp_like(uri_path, ',') as is_dynamic,
        regexp_like(uri_path, ',') and regexp_like(uri_query, 'title') as is_static_overlay,
        regexp_like(uri_path, ',') and not regexp_like(uri_query, 'title') as is_static_no_overlay,
        url_extract_parameter(concat('http://bla.org/woo/', uri_query), 'title') as title
    from wmf.webrequest
    where
        uri_host = 'maps.wikimedia.org'
        and regexp_like(uri_path, 'osm.*png$')
        and not regexp_like(http_status, '4..')
        and year = 2021
        and month = 11
        and day = 1
        and hour = 1
),

summary as (
    select
        normalized_host,
        count(*) as pageviews,
        count(distinct title) as detected_titles,
        count_if(is_mobile) as num_mobile,
        count_if(is_static_no_overlay) as num_static_no_overlay,
        count_if(is_dynamic) as num_dynamic,
        count_if(is_static_overlay) as num_static_overlay
    from map_request
    group by
        normalized_host
),

all_views as (
    select
        domain_name,
        sum(view_count) as total_views
    from wmf.projectview_hourly
    join canonical_data.wikis
        on concat(wikis.language_code, '.', wikis.database_group) = project
    where
        year = 2021
        and month = 11
        and day = 1
        and hour = 1
    group by
        domain_name
)

select
    normalized_host,
    total_views,
    pageviews,
    round(cast(pageviews as double) / total_views, 4) as maps_per_pageview,
    round(100.0 * num_mobile / pageviews, 1) as percent_mobile,
    round(100.0 * num_static_no_overlay / pageviews, 1) as percent_no_overlay,
    round(100.0 * num_dynamic / pageviews, 1) as percent_dynamic,
    round(100.0 * num_static_overlay / pageviews, 1) as percent_static_overlay,
    -- Rough factor suggesting how many unique titles are seen per pageview.
    round(1000.0 * detected_titles / pageviews) as title_diversity
from summary
left join all_views
    on domain_name = normalized_host
order by
    pageviews desc
limit 70
;
;

See results.

WMDE-Fisch changed the point value for this task from 3 to 0.
WMDE-Fisch removed Final Story Points.
awight claimed this task.