Create schema for landing page views
Add EventLogging call to FundraiserLandingPage
Create schema for landing page views
Quick sanity check:
The approach we've been looking at so far is to send an EventLogging event from the client for every pageview of a donation form on Donate Wiki. If I understand correctly, the advantage of doing this is that it will facilitate the ingress of data about the pageviews via a separate Kafka topic, so we don't have to filter the entire firehose of all web requests.
This would also essentially send back to our servers information that is already on the URL and fully available server-side. So it's an extra round trip from the client back to our infrastructure, just to get the data in the right place in our own infrastructure.
Somehow, it doesn't quite feel right... So, before moving forward, I thought we might circle back and make sure it's really our best option... Apologies for the bother! Thanks much!!!!
Maybe you can add descriptions to schema fields? That way is clear what things like "form-template" stand for.
Otherwise schema looks fine, it is more really up to FR folks to decide what dimensions to track.
Thanks!!! Pretty complex filtering in there--hard to tell just from the code what's current and what's cruft!
As I understand it, we're no longer interested in data from wikimediafoundation.org. So, of the top-level regex used by LoadLPImpressions.py to parse lines in the log files, we'd only be interested in the last two.
Of these, only one parses out "title" (that is, the page used to create the forms on Donatewiki). The other matches only pageviews of Special:LandingPage. (The article for any previous wiki page that was viewed is not in the data.) Either the Donatewiki article or the params used by Special:LandingPage to create the form eventually end up in the landingpage field in landingpageimpression_raw table.
As per discussions in Hangouts, we'll log all pageviews on Donatewiki via EventLogging. This should ensure the greatest possible equivalence with existing data. (At a later date, we could change logging/filtering to remove Donatewiki pages other than Special:LandingPage as needed--apparently they're unused, even though random pageivews of them still make their way into the logs.)
Also, looks like we can call country and language required.
OK! Will do... :)
I think it will be of use to look at this schema: https://meta.wikimedia.org/wiki/Schema:VirtualPageView you can probably reuse most fields and even some of the instrumenting code. Seems like you would want to add the campaign id to your schema but other than than that fields shoudl be pretty similar.
Also , seems like you are using schema for two types of information:
- basic pageview fields + campaign
- form field info
Maybe is worth decoupling those two pieces? If you do that and 1) follows a schema that just sends data about pageviews data could be aggreggated more easily.
Thanks so much for the suggestions!!! Interesting stuff... In this case we're trying to copy as closely as possible the fields currently used by the Python script that ingresses data from these specific pageviews into a Fundraising database. So, for now, it does make sense to keep all this together. (At this stage, it's important to keep the workflow of the users of this data intact. Later on, I do hope we can refactor to get some even better tools up and running...)
¡¡OK!! Thx :)
Yep, intentional ;P For anyone interested (/me hides under table) here they are in the legacy Python script. (May be taking this a bit too far, but the idea has been to be very incremental, so, jiggle things around as little as possible in that script and in the data format for now, and refactor on a longer timeline once the new pipeline is fully proven...) Thanks!!!
After staring at the code longer than is healthy, I decided to make those fields not required in the schema, because the script actually ensures those variables have values a little earlier on (so the conditional on line 360 partly dead code).
Regarding language, let's condense language and uselang URL params into a single schema property, since the Python script doesn't care which URL param it comes from, and having just one such param in the schema will keep it a bit simpler.
Description fields on the schema coming soooooon...
I'd really recommend not using a hyphen in a schema field name. Those fields are going to be directly mapped to SQL table fields (Hive, MySQL, etc.), and you'll either run into errors, or need to remember to always backtick quote them when querying them. Choosing names in schemas is really important! You aren't allowed to make backwards incompatible changes later, so you'll be stuck with what you choose forever!
BTW, just in case you haven't seen it: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines