Page MenuHomePhabricator

Investigate number of existing geoshape usages
Closed, ResolvedPublic3 Estimated Story Points

Description

We want to find out how often we will need to create an additional sparql query if we are to add automated popups to all existing geoforms (geoshapes, geomasks and geolines). We want to see if enabling this automatically for all might lead to performance issues.

Query:

1from wmfdata import hive
2from wmfdata.utils import num_str
3
4geolines = hive.run("""
5select COUNT(*), substr(uri_path, 1, 9)
6 from wmf.webrequest
7 where
8 uri_host = 'maps.wikimedia.org'
9 and webrequest_source = 'upload'
10 and uri_path rlike '^/(geo.*|v4/marker)'
11 and year = 2022
12 and month = 7
13 and day = 1
14 group by substr(uri_path, 1, 9)
15 ;""");

Outcome:

Data from 2022-07-01

geoshape2648771
geoline1558165
v4/marker1794439

Event Timeline

Although this could be accomplished by a one-off query of the webrequest table, this task is also a great opportunity to add StatsD metrics showing the prevalence of geoform queries.

Although this could be accomplished by a one-off query of the webrequest table, this task is also a great opportunity to add StatsD metrics showing the prevalence of geoform queries.

We decided against this because it would take some time until we would get data with this approach. We can still add the metrics in a separate ticket later (see T315972).

@dcausse @Gehel We are thinking about automatically getting titles and descriptions for each geoform using sparql queries (T307707). To make sure we are not creating too much turbulence, we looked at the current number of requests for geoforms (see ticket description). Would these numbers allow adding sparql queries in your opinion? We are not speaking of reliability here, just the general amount of queries per day. Here is an example query:

SELECT ?id ?geo ?title ?description ?image ?article WHERE {
  VALUES ?id {
    wd:Q1431922
  }
  ?id wdt:P625 ?geo;
    schema:description ?description;
    rdfs:label ?title;
    wdt:P18 ?image .
  ?article schema:about ?id;
    schema:inLanguage "en" .
  FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/")
  FILTER(LANGMATCHES(LANG(?title), "EN"))
  FILTER(LANGMATCHES(LANG(?description), "EN"))
}

@lilients_WMDE if I'm getting this correctly this could mean that the public wdqs service could get potentially an additional ~6M queries per day. At a glance it sounds a huge percentage or our traffic, we serve currently around 25M queries per day so this would be a 25% increase.

Thanks for the quick response @dcausse - given that potential increase, we've decided not to go further with auto-generated titles and descriptions.