Record and aggregate page previews
Open, NormalPublic8 Story Points

Description

This is the backend task corresponding to the client-side instrumentation work described in T184793: [EPIC] Instrument page interactions.
As decided in the recent Analytics-l thread ("How best to accurately record page interactions in Page Previews"), we want to record and aggregate the events generated by this instrumentation in a form consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. (There remain different in opinions whether we should extend the existing table - as e.g. outlined here - or create a separate one.)

Here is what this means in detail: we will record the fields corresponding to the folllowing from pageview_hourly:

project                 string                  Project name from requests hostname
language_variant        string                  Language variant from requests path (not set if present in project name)
page_title              string                  Page Title from requests path and query

(for page previews, this will be extracted from the Eventlogging event instead)

access_method           string                  Method used to access the pages, can be desktop, mobile web, or mobile app

This will always be desktop for now, although we may want to keep the field in case we want to extend this to e.g. measure link previews on the apps.

zero_carrier            string                  Zero carrier if pageviews are accessed through one, null otherwise

This one is probably not essential, given that most zero pageviews happen on mobile web, and other considerations.

agent_type              string                  Agent accessing the pages, can be spider or user
continent               string                  Continent of the accessing agents (computed using maxmind GeoIP database)
country_code            string                  Country iso code of the accessing agents (computed using maxmind GeoIP database)
country                 string                  Country (text) of the accessing agents (computed using maxmind GeoIP database)
subdivision             string                  Subdivision of the accessing agents (computed using maxmind GeoIP database)
city                    string                  City iso code of the accessing agents (computed using maxmind GeoIP database)
user_agent_map      	map<string,string>  	User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values
record_version          string                  Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly
view_count              bigint                  number of pageviews

(i.e. here number of page previews, mutatis mutandis)

page_id             	int                 	MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.
namespace_id            int                     MediaWiki namespace_id for this page title. [...]
year                    int                     Unpadded year of pageviews
month                   int                     Unpadded month of pageviews
day                     int                     Unpadded day of pageviews
hour                    int                     Unpadded hour of pageviews

The basic principle is that if a reader visits a page and then uses the page preview feature on that page to read preview cards, all the above metadata fields should have identical values for both the preview and the pageview. (save the obvious exotic cases like the reader travelling to another country between opening the page and opening the preview ;)

To the above, we will want to add the information about the page from which the preview is being viewed (basically the referrer), in the same format as for the page being consumed in preview form; say as source_page_title, source_page_id, and source_namespace_id.

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Inferring title from url can be done too but given that a preview is happening title or page_id should be available right ? as that content was retrieved from content service.

The page_id we are interested in is the page_id of the /target/ page. The only information we have available is that in the response: https://en.wikipedia.org/api/rest_v1/page/summary/San_Francisco - so we'd be able to get page_id and title but not namespace (but can derive namespace from title). My question is do we need to? I'm not sure what pageviews does, but would it be more consistent to calculate those the same way as they're done in pageviews?

The only information we have available is that in the response: https://en.wikipedia.org/api/rest_v1/page/summary/San_Francisco - so we'd be able to get page_id and title but not namespace (but can derive namespace from title). My question is do we need to?

Send them from the client? yes please. Page_id is send by mediawiki (in many but not all instances) page tittle is parsed from url when pageviews are refined but if you can send it it will save cpu cycles on backend cause the parsing would not be needed.

Inferring title from url can be done too but given that a preview is happening title or page_id should be available right ? as that content was retrieved from content service.

The page_id we are interested in is the page_id of the /target/ page. The only information we have available is that in the response: https://en.wikipedia.org/api/rest_v1/page/summary/San_Francisco - so we'd be able to get page_id and title but not namespace (but can derive namespace from title). My question is do we need to? I'm not sure what pageviews does, but would it be more consistent to calculate those the same way as they're done in pageviews?

I'll open a ticket with Readers Infra about adding namespace ID to the response. Working around a lack of data in the response by sending partial data from the client and reconstructing it on the server feels like knowingly introducing technical debt by splitting the implementation of the virtual pageview data pipeline across an API boundary (the EventLogging pipeline).

BTW, for webrequest, iirc we only get page_id and namespace_id sent from the the client in the X-Analytics header, which is set by MediaWiki. So, sending them from the client is how pageviews gets them too.

phuedx added a comment.EditedFeb 15 2018, 5:38 PM

I'll open a ticket with Readers Infra about adding namespace ID to the response. Working around a lack of data in the response by sending partial data from the client and reconstructing it on the server feels like knowingly introducing technical debt by splitting the implementation of the virtual pageview data pipeline across an API boundary (the EventLogging pipeline).

There's no need for a new ticket!

By the time that this instrumentation is deployed, the new Page Summary API will too have been deployed. Per the example response in T177431: Develop a Summary JSON API, the API returns the namespace ID in the namespace.id property, e.g. http://appservice.wmflabs.org/en.wikipedia.org/v1/page/summary/Dog. Moreover, the response from the MediaWiki API (the "fallback API") also returns the namespace ID in the ns property, e.g. https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=info%7Cextracts%7Cpageimages%7Crevisions&titles=Dog&formatversion=2&rvprop=timestamp.

I'll open a ticket with Readers Infra about adding namespace ID to the response.

New endpoint has this (I believe!)

I had a lovely chat with @Ottomata and I think I can pin down this schema today.

In terms of what I need to log:

  • page_title will be provided by Popups and retrieved from response
  • page_id will be provided by Popups and retrieved from response
    • namespace_id will be provided by Popups and will always send 0 (received from

summary endpoint in future)

  • mw.config will be used for source_page_title, source_page_id, and source_namespace_id.
  • target uri and referrer uri- access_method+project+language variant will be provided by default as they are part of EventLogging capsule but we'll add uris in case we need to determine additional information/context. In future we may change things to make other fields redundant and derived

Other fields will be provided in processing.

Change 411079 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] Upgrade schema and log the required fields

https://gerrit.wikimedia.org/r/411079

Change 411079 merged by Pmiazga:
[mediawiki/extensions/Popups@master] Upgrade schema and log the required fields

https://gerrit.wikimedia.org/r/411079

Jdlrobson added a comment.EditedFeb 26 2018, 7:25 PM

All done our end @Ottomata and @Nuria
The funnel is disabled by default via $wgPopupsVirtualPageViews. When enabled https://meta.wikimedia.org/wiki/Schema:VirtualPageView will be hit for all mobile page views.
How shall we go about turning this on?
Do we want to do beta cluster first?

Nuria added a comment.Feb 26 2018, 8:14 PM

The funnel is disabled by default via $wgPopupsVirtualPageViews. When enabled https://meta.wikimedia.org/wiki/Schema:VirtualPageView will be hit for all mobile page views.

You mean "desktop" right?

How shall we go about turning this on? Do we want to do beta cluster first?

yes, please, like any eventlogging schema let's test on beta first.

Once that is done we ill get hits for all wikis in which the feature is enabled , correct?

Yup sorry! Desktop.

Once that is done we ill get hits for all wikis in which the feature is enabled , correct?

We can configure on a per wiki basis e.g. catalan wiki then hebrew then all. Just let us know what that roll out plan would look like.

Nuria added a comment.Feb 26 2018, 8:24 PM

Sounds good , let's test correctness in beta cluster and after we can enable it in prod for one wiki and see how it goes.

Change 414769 had a related patch set uploaded (by Jdlrobson; owner: Pmiazga):
[operations/mediawiki-config@master] Enable VirtualPagePreviews events on beta cluster

https://gerrit.wikimedia.org/r/414769

Change 414769 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[operations/mediawiki-config@master] beta: enable VirtualPagePreviews events on beta cluster

https://gerrit.wikimedia.org/r/414769

Change 414769 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: enable VirtualPagePreviews events on beta cluster

https://gerrit.wikimedia.org/r/414769

Mentioned in SAL (#wikimedia-operations) [2018-02-28T14:17:13Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:414769|beta: enable VirtualPagePreviews events on beta cluster (T184793 T186728)]] (duration: 00m 57s)

Ah, I just realized that agent_type can be obtained from the EventLogging userAgent['is_bot'] field. The logic that sets this is the same as the one that sets agent_type in webrequest. So, if userAgent['is_bot'] agent_type = 'spider' else agent_type = 'user'.

FYI, I just deployed the change that does geocoding (and deduplication) for EventLogging events in Hive. Check it out! From now on, all tables in the Hive event database should have a gecododed_data map field.

💪 Excellent work, @Ottomata!

Change 419268 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist VirtualPageView schema from EL MySQL

https://gerrit.wikimedia.org/r/419268

Change 419268 merged by Ottomata:
[operations/puppet@production] Blacklist VirtualPageView schema from EL MySQL

https://gerrit.wikimedia.org/r/419268

For those following along - T188310 tracks the aggregation job and my team should work to carefully roll out the schema to all wikis (T189906). Once those steps are done and DNT is added (T187277) we should be good to go.

Thanks again for researching this! BTW what is our process for keeping these two regexes (webrequest/EL) in sync?

More notes from our meeting on Friday, for reference:

  • @Jdlrobson verified (by tracking down the sources for each) that the titles of the previewed page and the source page are stored in the same format (in particular, no blanks vs. underscores discrepancies or the like)
  • @Ottomata confirmed that we will be able to generate the project field in the required form during aggregation
  • Also, we were wondering about the current rate of events from huwiki, which seemed way too low on this dashboard - but I now understand that's because of the recent backend switch; it looks more plausible on the new "Jumbo" dashboard.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 28 2018, 12:12 PM

Hey all, I started looking into aggregating event.virtualpageviews to wmf.virtualpageviews_hourly.

What about page previews of redirect pages? How does the page preview client handle those?
Depending on how is works, it might count previews differently from pageviews.
IIRC, pageview_hourly counts soft redirects as pageviews.

@mforns in theory we should never log events from redirects as we only ever show the preview of the canonical uri. If we do it would suggest a bug in restbasr.

I'm not sure the same applies to source_page_title but since that's referrer but it might do (let me check today)

@mforns We use wgPageTitle for source title thus all titles will be canonical...
... with one one edge case: when the redirect is disabled.

A page preview of Barack Obama on https://en.wikipedia.org/wiki/Barack?redirect=no will send an event with source_page_title "Barack" and page_title of the page preview as "Barack Obama".

mforns added a comment.EditedMar 28 2018, 6:40 PM

@Jdlrobson
That's why I mentioned that. I understand you want to equate pagePreviews with pageViews as much as possible.
If I'm not confused, an actual pageview resulting on a click to a redirect, counts as pageview to the redirect, even if the contents shown is the target.
The overall count of pageviews/previews should equate though.

I think we are good then. Here the page view to Barack Obama will also be tracked. Only the source url might be a non canonical title here but that's not what we are counting

OK, thanks!

Another question:
Pageview_hourly extracts its project field using the GetPageviewInfo UDF on the uriHost, uriPath and uriQuery of a webrequest.
Page previews does not have those exact fields, and also does not provide the target URL (obviously because it hasn't been clicked).
Now, as an alternative, we can call GetPageviewInfo UDF with EL's webHost field as uriHost (same), plus empty strings for uriPath and uriQuery.
This would return the project just as in pageview_hourly! However, the language variant would not be returned because it is extracted from the uriPath.

Options:

  1. We can leave the language_variant field blank, or not have such a field.
  2. We can add the target URL to the event (not even sure this is possible)
  3. Can we extract the language variant from EL's webHost? If so, would it match the value from pageview_hourly?

@mforns, @Jdlrobson and I discussed this. You should be able to use the source_url field on the VirtualPageView event for this. In all cases, the project and language variant etc. will be the same for the source page as the target page.

We can add the target URL to the event (not even sure this is possible)

Can we not use source_url field in the schema? https://meta.wikimedia.org/wiki/Schema:VirtualPageView
The project/language will always be the same (for now as we don't support interwiki links)

Oh... of course. Sorryyy

mforns added a comment.Apr 4 2018, 7:24 PM

Another potential issue I've seen is that VirtualPageView.event.page_title is formatted as text (with spaces), whereas pageview_hourly.page_title is formatted as URL (with underscores). Would that be a problem? We might want to write a UDF to transform from text to URL and viceversa?

Tbayer added a comment.Apr 4 2018, 7:28 PM

Another potential issue I've seen is that VirtualPageView.event.page_title is formatted as text (with spaces), whereas pageview_hourly.page_title is formatted as URL (with underscores). Would that be a problem? We might want to write a UDF to transform from text to URL and viceversa?

Good point - these should be consistent (seems we somehow overlooked that when we discussed the format earlier T186728#4069562 ).

mforns added a comment.Apr 4 2018, 7:33 PM

@Tbayer
Would it be possible to send the page_title (and source_title) in URL format already from the client?

Jdlrobson added a comment.EditedApr 4 2018, 7:49 PM

@mforns for page_title we are at mercy data in the response of REST which includes spaces as " " not "_".
e.g. for https://en.wikipedia.org/api/rest_v1/page/summary/San_Francisco
For consistency we purposely used wgTitle so these values are calculated and formatted via the same method.

For example - note format of page_title and source_title without '_'

"/beacon/event?{"event":{"source_page_id":49728,"source_namespace":0,"source_title":"San Francisco","source_url":"https://en.wikipedia.org/wiki/San_Francisco","page_title":"New York City","page_id":645042,"page_namespace":0},"revision":17742133,"schema":"VirtualPageView","webHost":"en.wikipedia.org","wiki":"enwiki"};"

If we were to add the underscores it's easy for source_title (we can use wgRelevantPageName) but we'd need to do a regex on the title of the response from REST which may lead to inconsistencies (I'm not sure).

Likewise if we were to add a "page_url" field for the target page e.g. New York we'd need to calculate this on the client.
e.g.

> page_url: window.location.protocol + '//' + window.location.host + mw.util.getUrl( 'New York City' )
> "https://en.wikipedia.org/wiki/New_York_City"

Which specifically are you asking for? A page_url field representing the URL of the "target" or page_title with underscores?

We can do either/both, but it's likely to delay things our end so if you can do it your end that would speed things up..

mforns added a comment.Apr 4 2018, 7:58 PM

@Jdlrobson

In https://en.wikipedia.org/api/rest_v1/page/summary/San_Francisco I can see the URL-formatted title (underscores) under titles.canonical, is that accessible to you? Also, there is the target URL under content_urls.desktop.page, no? At this point, I don't think we need the full target URL, the page title would suffice. However, it would depend on the amount of work on your side. On our side, I don't think writing a UDF is a lot of work. I'm just concerned replacing spaces by underscores will not be enough to cover all cases, and we'll get inconsistencies.

Oh you're right! Looks like we could use that. We can fix this, but we'd be at the mercy of deployment schedule which may delay things. I I've setup T191471. Can you workaround this issue in the mean time since we are collecting data and this may take up to a week?

mforns added a comment.Apr 4 2018, 9:02 PM

Can you workaround this issue in the mean time since we are collecting data and this may take up to a week?

Sure, on our side the code won't change. I still have to test the query and then write the oozie job. It will take me a couple days.

Change 425281 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add job and query for page previews aggregation

https://gerrit.wikimedia.org/r/425281

@mforns the underscores are riding the train. Should be live thursday

@Jdlrobson thanks a lot!
Aggregation of VirtualPageViews into wmf.virtualpageview_hourly is tested and in CR, should be working in production soon.

Change 425281 merged by Milimetric:
[analytics/refinery@master] Add job and query for page previews aggregation

https://gerrit.wikimedia.org/r/425281

Change 427722 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Correct hardcoded ADD JAR in virtualpageview_hourly query

https://gerrit.wikimedia.org/r/427722

Change 427722 merged by Milimetric:
[analytics/refinery@master] Correct hardcoded ADD JAR in virtualpageview_hourly query

https://gerrit.wikimedia.org/r/427722

mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Fri, Apr 20, 4:01 PM
Nuria set the point value for this task to 8.Fri, Apr 20, 4:04 PM
Tbayer added a comment.Tue, May 1, 1:45 AM

@mforns How far are we on this? I see we have a table already that appears to be updated continuously (great!),[1] but it still seems to be lacking the information about the source page[2].

[1]
SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
SUM(view_count) AS previews_seen
FROM wmf.virtualpageview_hourly 
WHERE year = 2018
GROUP BY year, month, day
ORDER BY year, month, day LIMIT 10000;

year	month	day	date	previews_seen
2018	3	14	2018-03-14	203696
2018	3	15	2018-03-15	234867
2018	3	16	2018-03-16	245158
2018	3	17	2018-03-17	252319
2018	3	18	2018-03-18	298695
2018	3	19	2018-03-19	272950
2018	3	20	2018-03-20	264871
2018	3	21	2018-03-21	253392
2018	3	22	2018-03-22	2537790
2018	3	23	2018-03-23	12515531
2018	3	24	2018-03-24	11563951
2018	3	25	2018-03-25	12685168
2018	3	26	2018-03-26	13616315
2018	3	27	2018-03-27	13367097
2018	3	28	2018-03-28	17450024
2018	3	29	2018-03-29	16641374
2018	3	30	2018-03-30	15617317
2018	3	31	2018-03-31	14241821
2018	4	1	2018-04-01	15409279
2018	4	2	2018-04-02	17035765
2018	4	3	2018-04-03	17774771
2018	4	4	2018-04-04	28022449
2018	4	5	2018-04-05	27538249
2018	4	6	2018-04-06	28812237
2018	4	7	2018-04-07	25186370
2018	4	8	2018-04-08	27987938
2018	4	9	2018-04-09	32897373
2018	4	10	2018-04-10	33029987
2018	4	11	2018-04-11	32601631
2018	4	12	2018-04-12	35816500
2018	4	13	2018-04-13	33176392
2018	4	14	2018-04-14	27855916
2018	4	15	2018-04-15	32119020
2018	4	16	2018-04-16	38358023
2018	4	17	2018-04-17	47323312
2018	4	18	2018-04-18	64905208
2018	4	19	2018-04-19	63390029
2018	4	20	2018-04-20	58486445
2018	4	21	2018-04-21	47725016
2018	4	22	2018-04-22	54571985
2018	4	23	2018-04-23	69848241
2018	4	24	2018-04-24	69405197
2018	4	25	2018-04-25	67058807
2018	4	26	2018-04-26	65754591
2018	4	27	2018-04-27	59322710
2018	4	28	2018-04-28	47761966
2018	4	29	2018-04-29	52289569
2018	4	30	2018-04-30	65646576
48 rows selected (216.386 seconds)
[2]
DESCRIBE wmf.virtualpageview_hourly;

col_name	data_type	comment
project	string	Project name from hostname
language_variant	string	Language variant from path (not set if present in project name)
page_title	string	Page Title from EventLogging event (canonical)
access_method	string	Always desktop (virtualpageviews are a desktop only feature for now)
agent_type	string	Agent accessing the pages, can be spider or user
referer_class	string	Always internal (virtualpageviews are always shown in wiki pages)
continent	string	Continent of the accessing agents (maxmind GeoIP database)
country_code	string	Country iso code of the accessing agents (maxmind GeoIP database)
country	string	Country (text) of the accessing agents (maxmind GeoIP database)
subdivision	string	Subdivision of the accessing agents (maxmind GeoIP database)
city	string	City iso code of the accessing agents (maxmind GeoIP database)
user_agent_map	map<string,string>	User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values
record_version	string	Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/virtualpageview_hourly
view_count	bigint	Number of virtualpageviews of the corresponding bucket
page_id	bigint	MediaWiki page_id
namespace_id	int	MediaWiki namespace_id
year	int	Unpadded year
month	int	Unpadded month
day	int	Unpadded day
hour	int	Unpadded hour
	NULL	NULL
# Partition Information	NULL	NULL
# col_name            	data_type           	comment             
	NULL	NULL
year	int	Unpadded year
month	int	Unpadded month
day	int	Unpadded day
hour	int	Unpadded hour
28 rows selected (0.17 seconds)
Nuria added a comment.Tue, May 1, 2:07 AM

Source_page is available on the event data:

hive (event)> select event.source_url, event_source_title from VirtualPageView where year=2018 and month=04 and day=29 and hour=1 limit 10;

The agreggated data mimics pageview_hourly and thus it does not have a source page.

Source_page is available on the event data:

hive (event)> select event.source_url, event_source_title from VirtualPageView where year=2018 and month=04 and day=29 and hour=1 limit 10;

The agreggated data mimics pageview_hourly and thus it does not have a source page.

Instead of creatively reinterpreting this task, can we please just read it as written? The three missing fields (source_page_title, source_page_id, and source_namespace_id) have been listed there ever since it was filed almost three months ago, and were repeatedly referred to during the implementation discussion (e.g. T186728#3976447 etc).

Nuria added a comment.Tue, May 1, 3:00 PM

Ah, sorry I missed that you asked for this earlier. We shall talk internally about wether we can retain those fields as long as we retain pageview hourly data and thus they can be on the same table, we do not retain referrer info anywhere else in this detal and it is likely to make the issue of session reconstruction even more pronounced : https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly/Identity_reconstruction_analysis

Nuria added a comment.Tue, May 1, 3:42 PM

We will be doing the work of adding these fields and adding a work item to quantify the impact into identity reconstruction.

Tbayer added a comment.Tue, May 1, 6:46 PM

Thanks! And sure, the future pageviews sanitizing project should apply here too (and e.g. I don't think we need to preserve city-level geolocation permanently). On the other hand, as a reminder, we already make some link usage data with (source, target) pairs entirely public, in form of the clickstream datasets.

Nuria added a comment.Tue, May 1, 7:13 PM

On the other hand, as a reminder, we already make some link usage data with (source, target)

In aggreggate yes, not distinctively per user. The privacy risk is quite different. The data we are talking about here will be very granular (city, ua, source page, target page) the level of agreggation is a lot smaller and it is easy to reconstruct a session which is a concern given that our privacy retention guidelines specify we do not retain sessions beyond 90 days.

It is not easy to reconstruct a session from the clickstream dataset, nor even in conjunction with other public datasets available.

mforns added a comment.Thu, May 3, 3:14 PM

@Tbayer

Instead of creatively reinterpreting this task, can we please just read it as written?

It was my fault that those fields didn't make it to wmf.virtualpageview_hourly.
I failed to see those in the task description, I apologize for that, and will prioritize this fix.

On the other hand, your sentence struck me as passive aggressive.
I truly admire you for your work, and have never disrespected you.
I ask you to respect me as well, even when I fail.

Change 430889 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add source page fields to wmf.virtualpageview_hourly

https://gerrit.wikimedia.org/r/430889

@Tbayer
I will regenerate the whole table since the beginning of the dataset, so that the new fields are present overall.
Is it OK to delete the current data set? The recomputation will take 3-4 days.

Also, I killed the previous oozie coordinator, so current data set will not be refined any more.

mforns added a comment.Fri, May 4, 5:30 PM

Hey, no need to delete the current data set in advance. The data set with the new fields is being recomputed separately, and then we'll replace the old data with the new one.

Nuria added a comment.Mon, May 7, 4:43 PM

I can see how the addition of source-page for wikis with low traffic will make this agreggation not be such and how we will be retaining basically raw data, which we cannot do according to our data retention guidelines. Let's quantify that: https://phabricator.wikimedia.org/T194058

mforns moved this task from Done to In Progress on the Analytics-Kanban board.Tue, May 8, 6:16 PM
mforns added a comment.Tue, May 8, 6:25 PM

The data set is regenerating into mforns.virtualpageview_hourly.
From my manual checks, it looks good.
It will take around one more day to finish.
Then I will move it to wmf.virtualpageview_hourly.

Change 430889 merged by Mforns:
[analytics/refinery@master] Add source page fields to wmf.virtualpageview_hourly

https://gerrit.wikimedia.org/r/430889

The data set with the source fields is ready, accessible in hive: wmf.virtualpageview_hourly.
It is updated with new events every hour. There will be a lag of a couple hours, due to EventLogging import, refinement and aggregation into the final table.
I vetted the data and looks good, but please review!
Cheers

mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Fri, May 11, 8:05 PM

Thanks @mforns, also for keeping the existing data up earlier while the fix was implemented (I was able to use it for our quarterly check-in deck this week). Will take a look at the new version soon.

I don't know if this is the best venue to discuss meta/communications issues, but since you brought them up here:

@Tbayer

Instead of creatively reinterpreting this task, can we please just read it as written?

It was my fault that those fields didn't make it to wmf.virtualpageview_hourly.
I failed to see those in the task description, I apologize for that, and will prioritize this fix.

On the other hand, your sentence struck me as passive aggressive.
I truly admire you for your work, and have never disrespected you.
I ask you to respect me as well, even when I fail.

No disrespect was intended, such oversights happen to all of us from time to time.

It seems that there was a serious misunderstanding about the topic and addressee of this comment, which is quoted out of context here. From T186728#4170881 it should have been evident that it was not directed at you and not about that oversight.
Instead, it was responding to @Nuria's blunt dismissal of the issue that I had raised: "The agreggated data mimics pageview_hourly and thus it does not have a source page". This is what I had called a creative reinterpretation of the task, which had included source pages from the very beginning. And since we are talking about each others' feelings, I'd also like to note that I found this reaction by @Nuria extremely disrespectful myself. I felt it dismissed the problem I had raised, apparently without even having afforded the courtesy to review what my statements were referring to, and that it was lecturing me about the supposedly best way to obtain this data ("Source_page is available on the event data") even though we had had detailed discussions not long ago about why we consider EL not the best way to store and access this data, and event though she could have safely assumed that I was aware of the existence of that field in that EL schema, having worked on it extensively myself. I felt being treated condescendingly, saw our work and data needs not being taken seriously, and also (considering that this was apparently a very quickly written reply that forced me to spend more time on reiterating and explaining the issue, in the subsequent comment you quoted) got the impression that my time was considered less valuable than others'.
I do want to give @Nuria credit for subsequently acknowledging this momentary misrepresentation of the task and apologizing ("Ah, sorry I missed that you asked for this earlier"), and to all of you for correcting the missing fields swiftly.

And if you would like to instead review my initial comment T186728#4170212 that was indeed directed at you, @mforns, I hope you will find traces of my efforts to word it in a nonconfrontational, positive, open-ended way (despite my initial frustration about this oversight - which, again, I'm not holding against you; I have surely made similar mistakes myself in the past).

@Tbayer

I understand your position now, thanks.

I believe it is natural that there's a tension between your team (speaking more about data analysts here) and ours, because your team is in several occasions dependent on ours, and our team does not have the bandwidth that you guys would like, considering that there are other demands from other teams and from the natural growth of the data and tools we're maintaining. On the other hand, we try to be a steward of the privacy policy and data retention guidelines, which in practice end up hindering the flexibility of your data and thus adding to your work when performing data analysis. So yea, I understand your frustration.

Please understand, though, that we Analytics are trying our best on our side as well. And that in the view of this situation where conflicts already exist between some of us, aggressive comments do usually not help reach results, but rather make the conflicts bigger.

@Tbayer

I understand your position now, thanks.

I believe it is natural that there's a tension between your team (speaking more about data analysts here) and ours, because your team is in several occasions dependent on ours, and our team does not have the bandwidth that you guys would like, considering that there are other demands from other teams and from the natural growth of the data and tools we're maintaining. On the other hand, we try to be a steward of the privacy policy and data retention guidelines, which in practice end up hindering the flexibility of your data and thus adding to your work when performing data analysis. So yea, I understand your frustration.

I am not sure what privacy and bandwidth limitations had to do with this regrettable communications issue. Sure, those are necessary and sometimes difficult topics to discuss. But any questions related to these two had already been resolved at that point - we had agreed what data would be stored in the aggregate table and who would implement the aggregation (again, thanks for your work on this!). Rather, the frustration on my side was about things like people making strong but erroneous statements about what our team's data needs supposedly were, and what this task supposedly consisted in, instead of simply acknowledging and fixing the clear oversight that had been pointed out. And on your side, I understand that much of the frustration was about seeing yourself accused of deliberately diverging from the task as written when you implemented the aggregation. Again, that was not my intention and I had thought that this had been clear in T186728#4170881 , but perhaps there is something I missed, and I'm interested in what could have done to avoid that misunderstanding.

Please understand, though, that we Analytics are trying our best on our side as well. And that in the view of this situation where conflicts already exist between some of us, aggressive comments do usually not help reach results, but rather make the conflicts bigger.

Agreed. While we seem to have different a priori views of what counts as aggressive (I would have regarded the wording at the center of your concerns, "creatively reinterpreting this task", as perhaps unnecessarily flippant but not as an attack), the perception of the recipient and how it makes them feel is important, and I will try to be more careful about this in our future discussions. In turn, I would still love to also see improvements or at least acknowledgments regarding the IMHO problematic communication patterns from your team that I tried to describe in this one example above. But maybe we should take this conversation offline now.