Page MenuHomePhabricator

REQUEST: Add Special:AllEvents to allowlist for campaigns-product pageview tracking
Closed, ResolvedPublic

Description

Description

Can Special:AllEvents be added to the allowlist used in the pageview definition to obtain the benefits of using the pageview pipeline on that page?

Context: Campaigns-Product would like to measure pageviews and referrals on the Event list (T365407). Specifically, they need pageviews to the event list page on those wikis where they're currently deployed and page views from the event list page to the individual event pages such as Desafío Uruguay. Right now, there’s no data available out of the box for Special:AllEvents (T240676).
We're discussing tracking event list pageviews as part of planning for getting the campaigns-product team setup with a Superset dashboard (T365404). Note: in the next fiscal year the team will be working on/with wiki project pages and they will also need pageviews and referral traffic data on those pages (to/from the event list and generally).

Completion checklist

(At any point, just ask for help)

  • Add allevents to this list per this request
  • Add unit tests by adding new lines in the pageview test data, use examples like this to sanity check the change
  • (somewhat optional) to be safe: do a side-by-side comparison
    • Build refinery-source locally
    • understand the way that webrequest refine runs the pageview UDF
    • copy the custom jars you just built and need to run the pageview UDF to a statXXXX machine
    • run the new pageview UDF and the old one over an hour of wmf_raw.webrequest, looking for any discrepancies
  • Merge and Deploy Refinery Source (get review from Data Engineering too)
  • Update the Airflow job to point to the new Refinery Source version (add a new artifact and change this reference)
  • Merge and Deploy Airflow

Event Timeline

Hello, @VirginiaPoundstone! I'm not sure if you are the right person to ping, but I wanted to check in and see if this can be looked into? This work is high priority for WE 1.1 in this fiscal year. Thank you!

Hi @ifried. What is the timeline for this work? Seems feasible, but I will need to check in with engineers to get an estimate on the work.

CC @wdoran

Hello, @VirginiaPoundstone! If possible, in the next few weeks, since we already have people using the Event List and we would like to begin measuring that usage for the work we will be doing in this fiscal year. Thank you!

@ifried we will create an implementation task and add it to our next sprint which begins July 24th.

CC @Milimetric and @WDoranWMF

Hello, @VirginiaPoundstone! One update since writing this request is that we are planning to probably have 2 separate tabs on the Special:AllEvents page. One will be for 'Events' (which is what the user currently sees) and there will be a new tab for 'Communities,' which will be a curated list of WikiProjects on the wiki. If we have the 2 separate tabs, would it be possible to get pageviews of those tabs separately, once Special:AllEvents is added to the allowlist? We are curious to know what options may be available. Thanks in advance!

Hi @ifried! @Iflorez and I discussed your question yesterday and it's going to come down to how the feature actually works. Either:

  • Scenario A: User can switch between the tabs seamlessly (powered by JS) without needing to reload the page.
    • In this scenario there is only one pageview of this special page no matter how many times the user goes back and forth between the tabs.
  • Scenario B: When page loads it checks for ?tab=Events (default) or ?tab=Communities and switching between them causes the page to reload with different URI query parameter.
    • In this scenario there is one pageview for each time the user switches tabs.

Under Scenario A you would need to instrument those views client-side with Metrics Platform. The instrument creation documentation is being actively improved right now, so that will be helpful for the engineers.

Under Scenario B we could count the tab views manually and get the tab name out of ?tab=<Tab name> in the server-side (webrequest/raw pageview) logs.

When @VirginiaPoundstone's team allowlists Special:AllViews it will be included in the pageview counting pipeline and would become available on dashboards/tools like Pageviews Analysis. For example, here's pageviews for Special:Search: https://pageviews.wmcloud.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Special:Search

Anything more than that (e.g. counting views of specific contents of that page) would require bespoke calculations (and potentially client-side instrumentation).

@mpopov Thank you!

I spoke with @ifried just now about your comment. Ilana will think through the options. Scenario B may make the most sense for Campaigns-product.

Scenario B: When page loads it checks for ?tab=Events (default) or ?tab=Communities and switching between them causes the page to reload with different URI query parameter.
In this scenario there is one pageview for each time the user switches tabs.
Under Scenario B we could count the tab views manually and get the tab name out of ?tab=<Tab name> in the server-side (webrequest/raw pageview) logs.

After discussing, it appears that @ifried is most interested in gathering referral counts data to event pages or to wiki projects from Special:Events. That is because the goal is not looking at the tabs per se but navigating to an event page or wiki project page from whichever Special:Events tab.
When @ilana is ready, and assuming the need remains, she'll submit a new request to Research and Data Science for referral counts to event and wiki projects from Special:Events.

Change #1062719 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery/source@master] Add Special:AllEvents to the PageviewDefinition

https://gerrit.wikimedia.org/r/1062719

This comment was removed by mforns.

The code change above adds /wiki/Special:AllEvents to the PageviewDefinition, and also 1 unit test (didn't find the need for more).

I built the jar and made sure that the test failed before adding the code change, and passed after adding it.

Created a webrequest table under my database with:

CREATE TABLE `mforns`.`webrequest` (
  `hostname` STRING COMMENT 'Source node hostname',
  `sequence` BIGINT COMMENT 'Per host sequence number',
  `dt` STRING COMMENT 'Timestamp at cache in ISO 8601',
  `time_firstbyte` DOUBLE COMMENT 'Time to first byte',
  `ip` STRING COMMENT 'IP of packet at cache',
  `cache_status` STRING COMMENT 'Cache status',
  `http_status` STRING COMMENT 'HTTP status of response',
  `response_size` BIGINT COMMENT 'Response size',
  `http_method` STRING COMMENT 'HTTP method of request',
  `uri_host` STRING COMMENT 'Host of request',
  `uri_path` STRING COMMENT 'Path of request',
  `uri_query` STRING COMMENT 'Query of request',
  `content_type` STRING COMMENT 'Content-Type header of response',
  `referer` STRING COMMENT 'Referer header of request',
  `x_forwarded_for` STRING COMMENT 'X-Forwarded-For header of request (deprecated)',
  `user_agent` STRING COMMENT 'User-Agent header of request',
  `accept_language` STRING COMMENT 'Accept-Language header of request',
  `x_analytics` STRING COMMENT 'X-Analytics header of response',
  `range` STRING COMMENT 'Range header of response',
  `is_pageview` BOOLEAN COMMENT 'Indicates if this record was marked as a pageview during refinement',
  `record_version` STRING COMMENT 'Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest',
  `client_ip` STRING COMMENT 'Client IP - DEPRECATED - Same as IP.',
  `geocoded_data` MAP<STRING, STRING> COMMENT 'Geocoded map with continent, country_code, country, city, subdivision, postal_code, latitude, longitude, timezone keys and associated values.',
  `x_cache` STRING COMMENT 'X-Cache header of response',
  `user_agent_map` MAP<STRING, STRING> COMMENT 'User-agent map with browser_family, browser_major, device_family, os_family, os_major, os_minor and wmf_app_version keys and associated values',
  `x_analytics_map` MAP<STRING, STRING> COMMENT 'X_analytics map view of the x_analytics field',
  `ts` TIMESTAMP COMMENT 'Unix timestamp in milliseconds extracted from dt',
  `access_method` STRING COMMENT 'Method used to access the site (mobile app|mobile web|desktop)',
  `agent_type` STRING COMMENT 'Categorise the agent making the webrequest as either user or spider (automatas to be added).',
  `is_zero` BOOLEAN COMMENT 'NULL as zero program is over',
  `referer_class` STRING COMMENT 'Indicates if a referer is internal, external or unknown.',
  `normalized_host` STRUCT<`project_class`: STRING, `project`: STRING, `qualifiers`: ARRAY<STRING>, `tld`: STRING, `project_family`: STRING> COMMENT 'struct containing project_family (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m) and tld (org most often)',
  `pageview_info` MAP<STRING, STRING> COMMENT 'map containing project, language_variant and page_title values only when is_pageview = TRUE.',
  `page_id` BIGINT COMMENT 'MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.',
  `namespace_id` INT COMMENT 'MediaWiki namespace_id for this page title. This may not always be set, even if the page is actually a pageview.',
  `tags` ARRAY<STRING> COMMENT 'List containing tags qualifying the request, ex: [portal, wikidata]. Will be used to split webrequest into smaller subsets.',
  `isp_data` MAP<STRING, STRING> COMMENT 'Internet Service Provider data in a map with keys isp, organization, autonomous_system_organization and autonomous_system_number',
  `accept` STRING COMMENT 'Accept header of request',
  `tls` STRING COMMENT 'TLS information of request',
  `tls_map` MAP<STRING, STRING> COMMENT 'Map view of TLS information (keys are vers, keyx, auth and ciph)',
  `ch_ua` STRING COMMENT 'Value of the Sec-CH-UA request header',
  `ch_ua_mobile` STRING COMMENT 'Value of the Sec-CH-UA-Mobile request header',
  `ch_ua_platform` STRING COMMENT 'Value of the Sec-CH-UA-Platform request header',
  `ch_ua_arch` STRING COMMENT 'Value of the Sec-CH-UA-Arch request header',
  `ch_ua_bitness` STRING COMMENT 'Value of the Sec-CH-UA-Bitness request header',
  `ch_ua_full_version_list` STRING COMMENT 'Value of the Sec-CH-UA-Full-Version-List request header',
  `ch_ua_model` STRING COMMENT 'Value of the Sec-CH-UA-Model request header',
  `ch_ua_platform_version` STRING COMMENT 'Value of the Sec-CH-UA-Platform-Version request header',
  `referer_data` STRUCT<`referer_class`: STRING, `referer_name`: STRING> COMMENT 'Struct containing referer_class (indicates if a referer is internal, external, external(media sites), external(search engine) or unknown.) and referer name (name of referer when referer class is external(search engine) or external(media sites))',
  `webrequest_source` STRING COMMENT 'Source cluster',
  `year` INT COMMENT 'Unpadded year of request',
  `month` INT COMMENT 'Unpadded month of request',
  `day` INT COMMENT 'Unpadded day of request',
  `hour` INT COMMENT 'Unpadded hour of request')
USING parquet
PARTITIONED BY (webrequest_source, year, month, day, hour)
LOCATION 'hdfs://analytics-hadoop/user/mforns/data/wmf/webrequest'

Ran the same exact SparkSql command/query that is generating wmf.webrequest for one hour, using the newly compiled jar:

spark3-submit \
    --name refine_webrequest_hourly_text_special_allevents_test \
    --master yarn \
    --deploy-mode client \
    --queue production \
    --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark3 \
    --conf spark.executorEnv.SPARK_CONF_DIR=/etc/spark3/conf \
    --conf spark.dynamicAllocation.maxExecutors=128 \
    --conf spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark3/conf \
    --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark3 \
    --driver-cores 1 \
    --driver-memory 4G \
    --executor-cores 2 \
    --executor-memory 12G \
    --class org.apache.spark.sql.hive.thriftserver.WMFSparkSQLCLIDriver \
    hdfs:///wmf/cache/artifacts/airflow/analytics/wmf-sparksqlclidriver-1.0.0.jar \
    -f hdfs://analytics-hadoop/wmf/refinery/current/hql/webrequest/refine_webrequest_hourly.hql \
    -d refinery_jar=hdfs:///user/mforns/artifacts/refinery-job-0.2.46-SNAPSHOT-shaded.jar \
    -d source_table=wmf_raw.webrequest \
    -d webrequest_source=text \
    -d destination_table=mforns.webrequest \
    -d year=2024 \
    -d month=8 \
    -d day=13 \
    -d hour=18 \
    -d record_version=0.0.27 \
    -d coalesce_partitions=256 \
    -d spark_sql_shuffle_partitions=256 \
    -d excluded_row_ids=

Checked that overall webrequest count was not altered:

> select count(*) from wmf.webrequest where year=2024 and month=8 and day=13 and hour=18 and webrequest_source='text';
count(1)           
346708973
Time taken: 19.559 seconds, Fetched 1 row(s)

> select count(*) from mforns.webrequest where year=2024 and month=8 and day=13 and hour=18 and webrequest_source='text';
count(1)
346708973
Time taken: 39.385 seconds, Fetched 1 row(s)

Saw that the hour had only 1 extra webrequest tagged as pageview, in comparison to the production one:

> select count(*) from wmf.webrequest where year=2024 and month=8 and day=13 and hour=18 and webrequest_source='text' and is_pageview;
count(1)
34767312
Time taken: 41.352 seconds, Fetched 1 row(s)

> select count(*) from mforns.webrequest where year=2024 and month=8 and day=13 and hour=18 and webrequest_source='text' and is_pageview;
count(1)
34767313
Time taken: 40.784 seconds, Fetched 1 row(s)

Which matched, because, although there were 3 webrequests to Special:AllEvents, 2 of them were 404s, and only one of them was a real pageview.

It seems a bit strange that there is only 1 pageview to Special:AllEvents in one hour. But the data looks good. I will move forward and ask CR from Data Engineering :-)

@mforns, we are expecting very low pageviews at this point since we have not yet created ways for people to organically discover the page within their workflows. Thank you for doing this work!

Change #1062719 merged by jenkins-bot:

[analytics/refinery/source@master] Add Special:AllEvents to the PageviewDefinition

https://gerrit.wikimedia.org/r/1062719

Mentioned in SAL (#wikimedia-analytics) [2024-08-19T20:45:03Z] <mforns> deployed airflow-dags to analytics_test instance for T368303

I deployed the changes to the test cluster.
I will wait until tomorrow, to make sure there are no unexpected issues,
and then will deploy to the production cluster.

I just deployed to the production cluster.
I will be following the executions of webrequest loading in case there are issues.
Otherwise, this is done :-)

@ifried this should be done and collecting data for your team now.