As WMF staff, I want to be able to identify pageviews/visits originating with a shared link so that we have a durable (i.e., post-experiment) way of knowing how many readers find wiki projects via sharing and can prioritize future work accordingly.
References:
- webrequest documentation https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Traffic/Webrequest
- from referrer header https://en.wikipedia.org/wiki/en:HTTP_referer
Report
Goal
- use the wprov URL query parameter to indicate that a text-fragment URL comes from the Share Highlight feature
- extract relevant traffic from webrequest by looking for the reserved x_analytics_map.wprov value
Approach
- paste the following 4 text-fragment URLs with ?wprov=shhu0 in an etherpad:
- https://it.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20%C3%A8%20un%20progetto%20nato,Open%20Data%20in%20formato%20RDF.
- https://fr.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20est%20un%20projet%20universitaire%20et%20communautaire
- https://en.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20(from%20%22DB%22%20for%20%22database%22)%20is%20a%20project
- https://pl.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20(DB%20%E2%80%93%20baza%20danych)%20%E2%80%93%20projekt%20maj%C4%85cy%20na%20celu%20usystematyzowanie
- click on them
- wait for the relevant refine hourly job
- check webrequest
from wmfdata.spark import create_session spark = create_session(app_name='shh-referrers') q = """SELECT referer, referer_class, uri_host, uri_path, x_analytics_map.wprov AS wprov FROM wmf.webrequest WHERE year = 2026 AND month = 4 AND day = 17 AND hour = 14 AND http_status = 200 AND x_analytics_map.wprov = 'shhu0' """ ddf = spark.sql(q) ddf.show(truncate=False) +-------+-------------+----------------+-------------+-----+ |referer|referer_class|uri_host |uri_path |wprov| +-------+-------------+----------------+-------------+-----+ |- |none |fr.wikipedia.org|/wiki/DBpedia|shhu0| |- |none |pl.wikipedia.org|/wiki/DBpedia|shhu0| |- |none |en.wikipedia.org|/wiki/DBpedia|shhu0| |- |none |it.wikipedia.org|/wiki/DBpedia|shhu0| +-------+-------------+----------------+-------------+-----+
Takeaways
- the URL's text fragment doesn't seem to be stored
- there's nothing we can do if the source platform isn't referred. We'll get a - value, Etherpad is an example
- wmf_raw.webrequest holds non-refined data, perhaps worth a look