Page MenuHomePhabricator

Investigation: Which data source to use for GeoPoints
Closed, ResolvedPublic5 Estimated Story Points

Description

Following up T300042 and based on the POC developed there, we should add a caching mechanism for the query results.

We could use the same approach like Geoshapes or something else, if there is something better.

Open questions

  • How does SPARQL endpoint caching work? If the same query is made multiple times, is it executed again or not?
  • TBD: Ongoing request to the search team

[] How does geosearch caching work? (invalid for our use case here)

  • Is there caching for the geoshapes endpoint?
    • Looks like there is front caching for shape result ~24 hours.
    • Shape data locally is retrieved directly form a local DB.
  • If a geoshapes query returns a wikidata item with no geoshape (maybe among a list of other items which do have shapes), is there a visible error? How is this handled?
  • Research existing traffic: What proportion of SPARQL requests are coming from the geoshapes service?
    • What proportion of these queries return successfully?
  • seems super low, a few request per million SPARQL requests
    • How long do geoshapes queries take when SPARQL is being used (in contrast to QID input)?
  • might be irrelevant regarding amount of requests
    • Could/Should we improve the current approach to reduce traffic?
  • might be irrelevant regarding amount of requests
  • How many existing maps are there already using sparql queries for geoshapes? (Maybe interesting to find good pilot wikis for geopoints.)
  • also seems so low that it's probably irrelevant

Notes: https://docs.google.com/document/d/1HM5uo8onOVUws5zAT6taswRW_R0pSF3cf3IosmUcVkM

Relevant links

Details

Other Assignee
Andrew-WMDE

Event Timeline

lilients_WMDE renamed this task from Use caching mechanism for GeoPoints to Investigation: Use caching mechanism for GeoPoints.Mar 3 2022, 9:23 AM
lilients_WMDE set the point value for this task to 8.Mar 9 2022, 2:35 PM
lilients_WMDE updated the task description. (Show Details)
awight added subscribers: Tarrow, awight.

(I'm looking at the webrequest logs now, starting with this example query from @Tarrow.)

Empty-handed for the moment. I went into superset and this presto query comes back with no data. I've experimented with the regexp query and it cannot find "kartotherian" anywhere (".*" is just out of exasperation).

SELECT time_firstbyte, cache_status, http_status, response_size, http_method, uri_path, uri_query, content_type, dt, hour, user_agent  
from webrequest
where
  regexp_like(user_agent, '.*Maps.*') and
  uri_host='query.wikidata.org'
  and webrequest_source='text'
  and year=2022 and month=3 and day=1
WMDE-Fisch changed the point value for this task from 8 to 5.

Also found no matches for uri_host='wdqs.discovery.wmnet'.

lilients_WMDE updated Other Assignee, added: Andrew-WMDE; removed: awight.

So, here are the counts for WDQS requests coming from Karotherian vs. other sources. I would say we don't have to be concerned about the load we cause with any new geopoints features. (Caching is still an open question, of course.)

select
  count(*) as total_requests,
  sum(case when user_agent rlike 'kartotherian' then 1 else 0 end) as from_kartotherian
from wmf.webrequest
where
  uri_host='query.wikidata.org'
  and webrequest_source='text'
  and year=2022 and month=3
group by day;
total_requests  from_kartotherian
10020708        0
10935084        4
16188054        42
13376930        40
11559587        0
11205092        0
12956225        0
13969256        0
13340032        0
19021548        0
15284965        0
12968692        0
12982497        0
13393441        0
10390427        1
8516103 0
7841070 0
8576772 0
7612512 0
8044851 0
4621804 0
21 rows selected (1840.988 seconds)

Question: Is this ticket only about the server-side rendering of static map images? We would need this as well for the (enlarged) dynamic map. That should show the same markers, shouldn't it?

thiemowmde renamed this task from Investigation: Use caching mechanism for GeoPoints to Investigation: Which data source to use for GeoPoints.Jul 20 2022, 8:41 AM
thiemowmde moved this task from Ready for pickup to In sprint on the WMDE-GeoInfo-FocusArea board.

The next step is to refine this task according to what we learned in our meeting with the WMF Search Platform team. Create tasks to adapt our code to the new data source.

For future reference, @Andrew-WMDE wrote:

[…] chat about the Wikibase REST API […]. It along with other APIs are supposed to replace commonly used queries in order to reduce the load on the WDQS. However, it's still in an early experimental phase without any concrete timeline on when it will be production ready. This means we also don't yet know what SLO to expect when it is deployed. Additionally, for our GeoPoints use case, we would have to implement the appropriate routes ourselves.
Looking back at the notes from our meeting with [WMF]. I think we should be able to continue using WDQS, as long as we're happy with an SLO of 95% and we don't cause any cascading failures.

WMDE-Fisch subscribed.

We're going for WDQS for now.