Page MenuHomePhabricator

Instrument and collect data via CitationUsage schema
Closed, ResolvedPublic

Description

This is the follow-up task on T188461 where we finalize the schema for data collection.

  • [Baha] Instrument the schema
  • [Miriam or CLs] Notify the community T197113
  • [Baha] Start the data collection
  • [Miriam] Give a heads up to Tiziano and Michele to make sure they're ready to do early testing and sanity checking of the data.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 440557 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for Schema:CitationUsage

https://gerrit.wikimedia.org/r/440557

Mentioned in SAL (#wikimedia-operations) [2018-06-25T13:12:59Z] <hashar@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enable logging for Schema:CitationUsage at 1% | T191086 (duration: 00m 57s)

Config change to collect at 1% sampling rate has been deployed. The instrumentation code will go live on Thursday, at which point we should be able to see some data coming in.

Nuria added a comment.Jun 25 2018, 3:35 PM

Did you tested this in beta labs? Please do so if you have not, here are some docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster

Change 441901 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for Schema:CitationUsage

https://gerrit.wikimedia.org/r/441901

Change 441901 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for Schema:CitationUsage

https://gerrit.wikimedia.org/r/441901

Mentioned in SAL (#wikimedia-operations) [2018-06-25T18:13:31Z] <thcipriani@deploy1001> Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:441901|Enable logging for Schema:CitationUsage]] T191086 (noop beta-only sync) (duration: 00m 57s)

Did you tested this in beta labs? Please do so if you have not, here are some docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster

I see some events are being sent (others are not because of the difference between the beta cluster and the production environment), but I'm yet to find those events in the backend. I've checked all-events.log and client-side-events.log files. I've also checked the mysql database, but nothing's coming through. Any ideas why?

At 1% sampling rate we got 1254 events in one hour.

SELECT count(*) FROM event.citationusage WHERE year=2018 AND month=6 AND day=28 AND hour=22;

We'll be gathering data un-sampled in an hour.

Change 440867 abandoned by Bmansurov:
Increase Schema:CitationUsage sampling rate to 15%

Reason:
Data looks good at 1%, no need for this patch.

https://gerrit.wikimedia.org/r/440867

Change 441567 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase Schema:CitationUsage sampling rate to 100%

https://gerrit.wikimedia.org/r/441567

Mentioned in SAL (#wikimedia-operations) [2018-06-28T23:18:20Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Increase Schema:CitationUsage sampling rate to 100% (T191086) (duration: 00m 51s)

bmansurov added a comment.EditedJun 28 2018, 11:22 PM

@leila we started collecting data at 100% sampling rate. I've prepared a patch to stop data collection on 7/5.

@leila, I did some cursory validation of data and I think it looks good. I'm attaching some sample data in case you or @tizianopiccardi wants to take a look too.

Thanks @bmansurov, I'll take a look at it later!
@leila: would it be possible to have a chunk of the 100% sampling rate already accessible on the cluster from next Monday? @tizianopiccardi is spending the whole week at Stanford with me, so we can start to work on it together.

@Pirroh data is available in hive, it's in the event.citationusage table.

leila added a comment.Jun 29 2018, 7:59 PM

@bmansurov thanks.

@Pirroh We'll reach out to you re the data.

Change 441568 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop collecting data for Schema:CitationUsage

https://gerrit.wikimedia.org/r/441568

leila renamed this task from Instrument, collect data, and perform the first round of analysis on click-through data on citations/footnotes to Instrument and collect data via CitationUsage schema.Jul 12 2018, 4:15 PM
leila reassigned this task from leila to bmansurov.
leila updated the task description. (Show Details)
leila moved this task from In Progress to Done (current quarter) on the Research board.
DarTar closed this task as Resolved.Jul 28 2018, 2:11 AM
DarTar edited projects, added Research-Archive; removed Research.
DarTar moved this task from Default to Q4-FY18 on the Research-Archive board.Jul 28 2018, 2:14 AM

Change 454854 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for Schema:CitationUsage at 100%

https://gerrit.wikimedia.org/r/454854

@Nuria just a heads up we're deploying the CitationUsage schema again at 100%. This time a pageLoad event will also be logged. This will potentially increase the data we're collecting many times over.

Change 454854 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for Schema:CitationUsage at 100%

https://gerrit.wikimedia.org/r/454854

Mentioned in SAL (#wikimedia-operations) [2018-09-05T16:23:41Z] <thcipriani@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:454854|Enable logging for Schema:CitationUsage at 100%]] T191086 (duration: 01m 23s)

Change 458209 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Disable logging for Schema:CitationUsage

https://gerrit.wikimedia.org/r/458209

elukey added a subscriber: elukey.Sep 5 2018, 4:48 PM

Hello! We just received some alarms for Eventlogging's throughput, the new Citation Usage schema seems generating 2.2k events/second:

Is it intended? It seems a bit high at the moment, and afaics we went from 1% to 100% in one go..

Change 458211 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist eventlogging CitationUsage from MySQL import

https://gerrit.wikimedia.org/r/458211

Hi @elukey, yes that was planned. See T191086#4527351.

Change 458211 merged by Ottomata:
[operations/puppet@production] Blacklist eventlogging CitationUsage from MySQL import

https://gerrit.wikimedia.org/r/458211

Krinkle added a subscriber: Krinkle.Sep 5 2018, 4:54 PM

There's also about 20 citationUsage events per minute dropped client-side before sending (about 1% of potential events), due to urlSize constraints.

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?from=1536163200000&to=1536166800000&orgId=1&var-schema=CitationUsage

elukey added a comment.Sep 5 2018, 4:54 PM

Hi @elukey, yes that was planned. See T191086#4527351.

Hi :) Sure but as far as I can see there is no figure in there, you are currently generating 2200k events/second, that is way above the most frequent event before CitationUsage (VirtualPageView, ~800 events/second). Can we dial down the percentage to something like 30% now?

Mentioned in SAL (#wikimedia-analytics) [2018-09-05T16:56:21Z] <ottomata> restarting eventlogging processors to blacklist CitationUsage - T191086

@elukey I don't think we can reduce the sampling rate as that would introduce a research bias. @leila what do you think?

Change 458209 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable logging for Schema:CitationUsage

https://gerrit.wikimedia.org/r/458209

Mentioned in SAL (#wikimedia-operations) [2018-09-05T17:14:07Z] <krinkle@deploy1001> Synchronized wmf-config/InitialiseSettings.php: I73d2ce30 - T191086 (duration: 00m 57s)

There's also about 20 citationUsage events per minute dropped client-side before sending (about 1% of potential events), due to urlSize constraints.
https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?from=1536163200000&to=1536166800000&orgId=1&var-schema=CitationUsage

@Krinkle what do you suggest we do to not lose any data in this case?

@bmansurov It depends on the specifics of how the data for this schema is created. I'm not familiar with this code, but Reading devs will be. But the shortest version is: Send less data.

Given it affected 1% (and not all), it's probably to do with one of the variable length properties in the schema. Technically all strings are variable length, but the problematic ones are the ones holding values that are wholly dependant on user input. From a quick glance at Schema:CitationUsage, I see the following ones stand out: page_title, link_text, link_url, referrer. It would also be good to confirm that the performance timestamps are indeed truncated to integers without fractions, given those fractions can be quite long.

Then, try to reproduce it locally and see how easy or hard it is to produce circumstances with title/link text-url/referer that together make the event data larger, and see if some of them are not needed and/or can be truncated or omitted individually. Currently the result is that the event will be discarded in its entirety and never sent to EventLogging. For example page_title can be upto 255 chars and seems potentially redundant with page_id.

Shorter versions of link_url and referer could be hostnames, but I don't know what is vital to the research and what not. Link text is also a problem (practically unbounded), and might also be good to trim in some way, or to replace with null if longer than a certain size.

The end result should be an event object that is consistently within the size constraint.

Krinkle reopened this task as Open.Sep 6 2018, 2:17 AM

@Krinkle thanks! Looking at the event logging code I noticed that we maybe unnecessarily making URLs longer ourselves. When using the beacon API we could let the browser POST the data, but we're instead appending data to the URL in all cases, regardless of whether the browser supports the API or not. Given the schema also depends on performance.timing I suspect the these browsers also support the beacon API. I think we maybe able to avoid data loss if we let browsers that support the beacon API send data in the body of the request as opposed to the URL. What do you think?

Nuria added a comment.Sep 6 2018, 2:29 PM

@bmansurov we get data via varnish that does not see post body, only url.

Change 458561 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] Introduce Schema:CitationUsagePageLoad

https://gerrit.wikimedia.org/r/458561

Change 458561 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Introduce Schema:CitationUsagePageLoad

https://gerrit.wikimedia.org/r/458561

Change 458610 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/458610

Change 458835 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: limit some parameter lengths

https://gerrit.wikimedia.org/r/458835

@bmansurov @Krinkle

There's also about 20 citationUsage events per minute dropped client-side before sending (about 1% of potential events), due to urlSize constraints.

I might have misunderstood this, but I think 20 events per minute should be way less than 1%, no? We are receiving 2200 events per second. This makes 132K per minute. 20/132K=0.015%

@Krinkle wrote:

There's also about 20 citationUsage events per minute dropped client-side before sending (about 1% of potential events), due to urlSize constraints.

I might have misunderstood this, but I think 20 events per minute should be way less than 1%, no? We are receiving 2200 events per second.

Indeed. On the dashboard, I assumed both panels used the same unit, where ~20 of ~2000 makes 1%, but indeed, one is per-second and the other per-minute. Thanks :)

bmansurov added a comment.EditedSep 17 2018, 4:10 PM

Analytics heads up that we're deploying CitationUsage at 100%, and CitationUsagePageLoad at 10% (per our conversation with @Nuria on IRC) in about two hours. This should yield in about 150 req/sec and 250 req/sec respectively. Tomorrow if these numbers are correct, we'd like to increase the 10% to 33.3%, which will increase 250 req/sec to around 800 req/sec.

Change 458610 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/458610

Apparently, there was no train last week so our changes didn't make it to production. I'm delaying data collection until Thursday.

Change 460945 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] Revert "CitationUsage: limit some parameter lengths"

https://gerrit.wikimedia.org/r/460945

Change 460945 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Revert "CitationUsage: limit some parameter lengths"

https://gerrit.wikimedia.org/r/460945

Change 461117 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/461117

Change 461117 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/461117

Mentioned in SAL (#wikimedia-operations) [2018-09-20T18:21:52Z] <thcipriani@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:461117|Enable logging for CitationUsage and CitationUsagePageLoad]] T191086 (duration: 00m 51s)

Change 462507 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Increase Schema:CitationUsagePageLoad population size

https://gerrit.wikimedia.org/r/462507

We're increasing the sampling rate for CitationUsagePageLoad from 10% to 33.3% in a few hours.

Change 462507 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase Schema:CitationUsagePageLoad sampling rate

https://gerrit.wikimedia.org/r/462507

Mentioned in SAL (#wikimedia-operations) [2018-09-24T18:20:59Z] <gjg@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT [config] {{gerrit|462507}} Increase Schema:CitationUsagePageLoad population size ({{phab|T191086}}) (duration: 00m 50s)

Change 465418 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Stop collecting data CitaitonUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/465418

debt added a subscriber: debt.

Moving this off the sprint board for the Search Platform into 'watching' :) Let us know if you need any more help from us!

Change 465418 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop collecting data CitaitonUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/465418

Mentioned in SAL (#wikimedia-operations) [2018-10-25T11:08:58Z] <zfilipin@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:465418|Stop collecting data CitaitonUsage and CitationUsagePageLoad (T191086 T203253)]] (duration: 00m 57s)

bmansurov closed this task as Resolved.Dec 14 2018, 3:27 PM
bmansurov edited projects, added Research; removed Patch-For-Review.

Change 492344 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/492344

Ignore the above patch. Wrong bug number.