As WMF staff, I want to be able to identify pageviews/visits originating with a shared link so that we have a durable (i.e., post-experiment) way of knowing how many readers find wiki projects via sharing and can prioritize future work accordingly.
References:
- webrequest documentation https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Traffic/Webrequest
- from referrer header https://en.wikipedia.org/wiki/en:HTTP_referer
==Report
====Goal
- use the `wprov` URL query parameter to indicate that a text-fragment URL comes from the Share Highlight feature
- extract relevant traffic from webrequest by looking for the reserved `x_analytics_map.wprov` value
====Approach
- paste the following 4 text-fragment URLs with `?wprov=shhu0` in an [etherpad](https://etherpad.wikimedia.org/p/share_highlight_URL_referrer_test):
- https://it.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20%C3%A8%20un%20progetto%20nato,Open%20Data%20in%20formato%20RDF.
- https://fr.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20est%20un%20projet%20universitaire%20et%20communautaire
- https://en.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20(from%20%22DB%22%20for%20%22database%22)%20is%20a%20project
- https://pl.wikipedia.org/wiki/DBpedia?wprov=shhu0#:~:text=DBpedia%20(DB%20%E2%80%93%20baza%20danych)%20%E2%80%93%20projekt%20maj%C4%85cy%20na%20celu%20usystematyzowanie
- click on them
- wait for the relevant [refine hourly job](https://airflow.wikimedia.org/dags/refine_webrequest_hourly_text/grid)
- check webrequest
```python
from wmfdata.spark import create_session
spark = create_session(app_name='shh-referrers')
q = """SELECT
referer,
referer_class,
uri_host,
uri_path,
x_analytics_map.wprov AS wprov
FROM
wmf.webrequest
WHERE
year = 2026
AND month = 4
AND day = 17
AND hour = 14
AND http_status = 200
AND x_analytics_map.wprov = 'shhu0'
"""
ddf = spark.sql(q)
ddf.show(truncate=False)
+-------+-------------+----------------+-------------+-----+
|referer|referer_class|uri_host |uri_path |wprov|
+-------+-------------+----------------+-------------+-----+
|- |none |fr.wikipedia.org|/wiki/DBpedia|shhu0|
|- |none |pl.wikipedia.org|/wiki/DBpedia|shhu0|
|- |none |en.wikipedia.org|/wiki/DBpedia|shhu0|
|- |none |it.wikipedia.org|/wiki/DBpedia|shhu0|
+-------+-------------+----------------+-------------+-----+
```
====Takeaways
- the URL's text fragment doesn't seem to be stored
- there's nothing we can do if the source platform isn't referred. We'll get a `-` value, Etherpad is an example
- `wmf_raw.webrequest` holds non-refined data, perhaps worth a look