Tbayer (Tilman Bayer)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 11:21 PM (216 w, 3 d)
Availability
Available
IRC Nick
HaeB
LDAP User
Unknown
MediaWiki User
Tbayer (WMF) [ Global Accounts ]

Recent Activity

Today

Tbayer added a comment to T211827: Request: Top articles of 2018 on all Wikipedias.

I replied at T183903#4822644 but moving discussion here.

I ran this query, similar to the suggested one:

Fri, Dec 14, 2:06 AM · Product-Analytics, Reading-analysis
Tbayer added a comment to T211195: [Spike 16hrs] Investigate opt-in audience and instrumentation.

##Summary

...

##Tracking anonymous users
To track user retention we need to identify somehow which events come from same users. We cannot use client pageToken nor sessiontoken as those stay in the browser only for the current pageview/session. When a user enables the AMC, we could store some unique identifier in local storage, and then send that identifier with every opt-in/opt-out request (it will have to be passed to the server on Special:MobileOptions page). But that value might be identifying. We don't want to assign any identifiers to users.

Just to avoid confusion, this sentence refers to *anonymous* editors (we do of course assign identifiers to logged-in users, namely their public user name and ID).

Instead, we can store last AMC opt-in/opt-out date in local storage. When user opt-in for the first time we send event with lastActionDate=null and we store in local storage the current date. Then on every opt-in/opt-out we will send the lastActionDate=localStorage.get('amc.lastactiondate') with the event, and then override the local storage mc.lastactiondate to the current date.
Each event will have current date, and the time of last action date which should allow us to track events chain (when given browser opted in/opted out). Checking retention rate for anon users is going
be difficult for the analyst (creating a query take takes into consideration dates), but it's possible.

It's not terribly difficult per se, assuming that every opt-out event comes with the date of the preceding opt-in. But the resulting data is going to be more brittle than for logged-in users, for example because we have no way to distinguish between retained anonymous users and those who lost their cookie/amc.lastactiondate value and opted in again with lastActionDate=null.

Fri, Dec 14, 1:42 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), Spike, Advanced Mobile Contributions
Tbayer added a comment to T211195: [Spike 16hrs] Investigate opt-in audience and instrumentation.

...

Can events logged on the server AND on the client be tied together? For example, if I Iogged that a user visits the mobile options page on the server and makes a change on the client, can we recognize in the logs that both events are from the same user?

Yes, events can be logged on both sides, but I'm not sure if this is possible to identify that both events (server and js) comes from the same user as we try to make events not identifying.

To clarify, the PrefUpdate schema does log the user ID (see documentation). (@Niedzielski , by "makes a change on the client", did you refer to making an edit to a page, or were you talking about a hypothetical new schema logging preference changes on the client side?)

Fri, Dec 14, 1:29 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), Spike, Advanced Mobile Contributions
Tbayer updated the task description for T211195: [Spike 16hrs] Investigate opt-in audience and instrumentation.
Fri, Dec 14, 1:09 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), Spike, Advanced Mobile Contributions
Tbayer updated the task description for T210660: [EPIC] AMC Metrics .
Fri, Dec 14, 1:07 AM · Epic, Advanced Mobile Contributions, Readers-Web-Backlog
Tbayer added a comment to T210012: Define how we vet code & data for ongoing, automated ingestion in Druid.

Related (but still quite expansible) documentation: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines

Fri, Dec 14, 12:45 AM · Product-Analytics

Yesterday

Tbayer added a comment to T211833: [BUG] User agent parsing error for MobileWikiAppSearch table.

Seems this is also affecting non-iOS EventLogging schemas, e.g. ReadingDepth and NavigationTiming.

Thu, Dec 13, 9:57 PM · Analytics-Kanban, Product-Analytics, Analytics
Tbayer moved T211843: Update Audiences page and Key Product Metrics deck with February 2019 Readers data from Triage to Blocked on the Product-Analytics board.
Thu, Dec 13, 1:12 AM · Product-Analytics
Tbayer created T211843: Update Audiences page and Key Product Metrics deck with February 2019 Readers data.
Thu, Dec 13, 1:12 AM · Product-Analytics
Tbayer moved T211842: Update Audiences page and Key Product Metrics deck with January 2019 Readers data from Triage to Blocked on the Product-Analytics board.
Thu, Dec 13, 1:11 AM · Product-Analytics
Tbayer added a project to T211842: Update Audiences page and Key Product Metrics deck with January 2019 Readers data: Product-Analytics.
Thu, Dec 13, 1:10 AM · Product-Analytics
Tbayer moved T211841: Update Audiences page and Key Product Metrics deck with December 2018 Readers data from Triage to Blocked on the Product-Analytics board.
Thu, Dec 13, 1:10 AM · Product-Analytics
Tbayer created T211842: Update Audiences page and Key Product Metrics deck with January 2019 Readers data.
Thu, Dec 13, 1:09 AM · Product-Analytics
Tbayer created T211841: Update Audiences page and Key Product Metrics deck with December 2018 Readers data.
Thu, Dec 13, 1:09 AM · Product-Analytics
Tbayer moved T211840: Update Audiences page and Key Product Metrics with November 2018 Readers data from Triage to Doing on the Product-Analytics board.
Thu, Dec 13, 1:08 AM · Product-Analytics
Tbayer created T211840: Update Audiences page and Key Product Metrics with November 2018 Readers data.
Thu, Dec 13, 1:08 AM · Product-Analytics

Wed, Dec 12

Tbayer awarded T45: Phabricator should suggest possible duplicates when creating a new task a 100 token.
Wed, Dec 12, 7:48 AM · Developer-Wishlist (2017), Upstream, Phabricator (Upstream), Wikimedia Phabricator RfC
Tbayer added a comment to T211195: [Spike 16hrs] Investigate opt-in audience and instrumentation.

@pmiazga and I discussed various aspects of this today, he is going to write up some things here, and I will follow up with other details. But to note one thing already as a direct followup on today's meeting:

Wed, Dec 12, 1:41 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), Spike, Advanced Mobile Contributions

Tue, Dec 11

Tbayer awarded T211606: As a user of Superset I would like it to be up-to-date so I'm not blocked by bugs that have already been fixed a Cup of Joe token.
Tue, Dec 11, 4:15 PM · Analytics, Product-Analytics

Mon, Dec 10

Tbayer added a comment to T211195: [Spike 16hrs] Investigate opt-in audience and instrumentation.

Is the opt-in/out status going to be stored in the user preferences (for logged-in users)? In that case we could first look at what the existing PrefUpdate schema can give us.

Mon, Dec 10, 8:28 PM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), Spike, Advanced Mobile Contributions
Tbayer updated the task description for T211195: [Spike 16hrs] Investigate opt-in audience and instrumentation.
Mon, Dec 10, 8:26 PM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), Spike, Advanced Mobile Contributions

Fri, Dec 7

Tbayer updated subscribers of T210687: Bug: can't make a YoY time series chart in Superset.

For some reason @JKatzWMF was more recently able to get Superset to make a YoY chart here: https://goo.gl/Vg88uv

Fri, Dec 7, 8:06 PM · Analytics-Kanban, Product-Analytics, Analytics
Tbayer closed T211142: Do readers use categories, or just editors? as Resolved.

Cool! I'll close this for now; might reopen it in case I get to look at 3. above later.

Fri, Dec 7, 6:31 AM · Product-Analytics
Tbayer added a comment to T211077: Investigate referrer class change on Chrome Mobile from September 13, 2018.

See also the observations in T195880 (note: "none" != "unknown")

Fri, Dec 7, 5:15 AM · Analytics, Product-Analytics
Tbayer added a comment to T195880: % of "none" referers seems too high.

See also T211077 (TLDR: it looks like a lot of formerly "unknown" referrers on Chrome Mobile are now, since around September 13, classified as "external (search engine)")

Fri, Dec 7, 5:15 AM · Readers-Web-Backlog, Analytics
Tbayer updated the task description for T195880: % of "none" referers seems too high.
Fri, Dec 7, 4:57 AM · Readers-Web-Backlog, Analytics
Tbayer added a comment to T154702: Fix broken referer categorization for visits from Safari browsers.

I added an entry about this to the log at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16

Fri, Dec 7, 12:02 AM · Browser-Support-Apple-Safari, Upstream, Traffic, Operations

Thu, Dec 6

Tbayer awarded T144780: Translation Notification Bot sending the same message multiple times to every translator a Heartbreak token.
Thu, Dec 6, 10:25 PM · User-Nikerabbit, MediaWiki-extensions-TranslationNotifications
Tbayer moved T211142: Do readers use categories, or just editors? from Triage to Doing on the Product-Analytics board.
Thu, Dec 6, 9:21 PM · Product-Analytics

Wed, Dec 5

Tbayer updated the task description for T210660: [EPIC] AMC Metrics .
Wed, Dec 5, 6:22 PM · Epic, Advanced Mobile Contributions, Readers-Web-Backlog
Tbayer added a comment to T211142: Do readers use categories, or just editors?.

Here is a quick, partial answer for enwiki:

Wed, Dec 5, 9:43 AM · Product-Analytics
Tbayer added a comment to T205681: Metrics request on portal namespace usage.

@Tbayer Just curious if the analytics team had time to pull any data for this yet.

Thanks for the ping! I spent some time working on this a couple of weeks ago, but encountered an unexpected issue with the referer data, which gave rise to some questions about its validity in general (basically, an implausibly large number of referers are HTTP instead of HTTPS URLs), and I ran out of the allotted time while investigating this. I think I'll be able to get back to that and wrap this task up (possibly with somewhat less accurate results) by early next week.

Wed, Dec 5, 9:32 AM · Analytics, Product-Analytics
Tbayer added a comment to T209051: ReadingDepth schema is whitelisting both session ids and page ids.

For the record: decided with @ovasileva to remove the session IDs and keep the page IDs. I'll see to submit the patch soon.

Wed, Dec 5, 9:24 AM · Analytics

Tue, Dec 4

Tbayer added a comment to T209891: Analyze results of sameAs A/B test.

Thanks @Niedzielski and @GoranSMilovanovic! I ran a query based on that approach (the wikibase_item page property) for a few wikis, more out of curiosity (I guess @mpopov might incorporate a more thorough look at this in his analysis). It confirmed the assumption that the vast majority of Wikipedia articles have a Wikidata item.

Tue, Dec 4, 8:27 PM · Product-Analytics, SEO
Tbayer added a comment to T208457: AMC: Collect data on users who are switching from mobile to desktop.

What is the plan for measuring the impact of AMC on this metric?

Tue, Dec 4, 11:00 AM · Advanced Mobile Contributions, Patch-For-Review, MinervaNeue, Readers-Web-Backlog
Tbayer created T211077: Investigate referrer class change on Chrome Mobile from September 13, 2018.
Tue, Dec 4, 2:17 AM · Analytics, Product-Analytics

Mon, Dec 3

Tbayer added a comment to T169550: Final Vetting of Family Wide unique devices data .

@Tbayer: do you have some more comments related to vetting of this metric or is this the only one?

Mon, Dec 3, 4:06 PM · Analytics, Product-Analytics, Reading-analysis, Analytics-Kanban
Tbayer updated subscribers of T205458: Remove sessionId, pageId pairs from whitelist .

@HaeB quick ping on this so that it doesn't get buried :)

Mon, Dec 3, 3:57 PM · Analytics-Kanban, Analytics

Sat, Dec 1

Tbayer added a comment to T202594: wmfdata package can be installed but not imported.

Testing again after Neil's update:

It now detects the outdated matplotlib and appears to try resolve it using "kiwisolver", but unsuccessfully, resulting in the same error message for the import:

Actually, kiwisolver is one of matplotlib's dependencies, so pip is just resolving that first.

I'm not sure what's going on; pip says it has successfully installed matplotlib-3.0.2, which should definitely have PercentFormatter. @Tbayer, can you run pip show matplotlib to verify which version is installed?

Sat, Dec 1, 1:19 AM · Contributors-Analysis, Product-Analytics

Thu, Nov 29

Tbayer created P7869 Browsers with largest year-over-year pageview changes Q4&Q1 2018.
Thu, Nov 29, 6:30 PM
Tbayer awarded T210687: Bug: can't make a YoY time series chart in Superset a Hungry Hippo token.
Thu, Nov 29, 12:59 AM · Analytics-Kanban, Product-Analytics, Analytics
Tbayer added a comment to T210687: Bug: can't make a YoY time series chart in Superset.

For illustration: It might look like this chart (that I'm currently generating by hand in Google Sheets).

Thu, Nov 29, 12:59 AM · Analytics-Kanban, Product-Analytics, Analytics

Wed, Nov 28

Tbayer created P7860 Countries with largest year-over-year pageview changes Q4&Q1 2018.
Wed, Nov 28, 11:57 PM

Tue, Nov 27

Tbayer added a comment to T202594: wmfdata package can be installed but not imported.

Testing again after Neil's update:

Tue, Nov 27, 3:31 AM · Contributors-Analysis, Product-Analytics
Tbayer reopened T202594: wmfdata package can be installed but not imported as "Open".

Import still fails for me, but with a new error message:

Tue, Nov 27, 3:01 AM · Contributors-Analysis, Product-Analytics

Mon, Nov 26

Tbayer updated subscribers of T209891: Analyze results of sameAs A/B test.

Thanks @GoranSMilovanovic! It is indeed about mainspace pages only, but about those that have an associated Wikidata item (i.e. appear in the sitelinks of said item), rather than making use of its properties.
I started drafting a query myself using wb_items_per_site, but the result for enwiki looks implausibly low: https://quarry.wmflabs.org/query/31482 Do you happen to see what might be wrong with the query?

Mon, Nov 26, 11:39 AM · Product-Analytics, SEO

Thu, Nov 22

Tbayer updated subscribers of T209891: Analyze results of sameAs A/B test.

And to record something here from our earlier offline discussions:

Thu, Nov 22, 7:47 AM · Product-Analytics, SEO
Tbayer awarded T209891: Analyze results of sameAs A/B test a Mountain of Wealth token.
Thu, Nov 22, 1:47 AM · Product-Analytics, SEO
Tbayer added a comment to T209891: Analyze results of sameAs A/B test.

Besides determining whether there was a change, I think we should also try to assess its size (and sign ;)

Thu, Nov 22, 1:44 AM · Product-Analytics, SEO

Wed, Nov 21

Tbayer awarded T209422: Investigate the Sep 7 drop and the Oct 25 spike in iOS app's 7-day retention a Cup of Joe token.
Wed, Nov 21, 6:47 AM · Product-Analytics

Tue, Nov 20

Tbayer renamed T209598: Aggregate ReadingDepth data in a form suitable for interactive visualization from Aggregate ReadingDepth data for ingestion into Druid to Aggregate ReadingDepth data in a form suitable for interactive visualization.
Tue, Nov 20, 10:41 PM · Product-Analytics
Tbayer updated subscribers of T209598: Aggregate ReadingDepth data in a form suitable for interactive visualization.

@Nuria I thought the requirements from the user perspective were evident from the task, but to clarify it a bit more:

Tue, Nov 20, 10:40 PM · Product-Analytics
Tbayer added a comment to T209999: [EPIC] Headline property in Article schema should provide useful content.

What are the length recommendations for this? Is https://developers.google.com/search/docs/data-types/article about this? It says "Headlines should not exceed 110 characters." Aren't page previews extracts normally much longer?

Tue, Nov 20, 7:19 PM · Epic, Readers-Web-Backlog, SEO
Tbayer added a comment to T209051: ReadingDepth schema is whitelisting both session ids and page ids.

A handful of thoughts:

The current schema has page_title, but not page_id. We were able to recover page_id from this using the page_title and the timestamps. Isn't this also a violation of the policy?

Page title and ID contain largely the same information, so if we whitelist one of them, the other should be fine too (and vice versa - if one of them needs to be purged, the other should too).

I"m not sure that I'm clear on what makes sessionToken PII and not IP address.

IP addresses are PII (actually they are more sensitive than session tokens), and indeed the corresponding field is not contained in the whitelist for this schema.

Would it be OK to replace sessionToken with an ID of the previous page token? We could then perform any analysis that doesn't involve joining on sessionToken.

If you mean the page token of the immediately preceding pageview in the session, that probably wouldn't make a big difference privacy-wise, because the session could still be reconstructed.

Could a reasonable option be to generate the statistics we need from the pages, aggregate or add noise to make them non-identifying and then remove the page_id column?

I think we will want to remove the session IDs instead, as (IIRC) less of our data questions depend on them. But there too we could think about calculating and storing some of the session-dependent data in aggregated form.

Tue, Nov 20, 1:47 AM · Analytics
Tbayer added a comment to T117945: Add alarms for high volume of views to pages with replacement characters.

See also https://meta.wikimedia.org/wiki/Talk:Pageviews_Analysis#Topviews_bug_report:_%EF%BF%BD_character_displayed_instead_of_umlauts_in_Topviews_analysis

Tue, Nov 20, 1:23 AM · Analytics, Analytics-Data-Quality, Datasets-Webstatscollector, Language-Team
Tbayer added a comment to T182235: Consolidate, simplify and cleanup data collection relating to Special:MobileOptions.

The same numbers broken down by project:

wikiall beta views /day% betalogged in beta views /daylogged in % beta
en.wikipedia596700.0465269176.9017
ar.wikipedia116700.2229356318.6706
ja.wikipedia73430.034445707.5832
es.wikipedia72030.033228798.0482
de.wikipedia48700.0317306210.2721
zh.wikipedia46960.07933709.8615
ru.wikipedia45050.03515926.6541
fr.wikipedia35040.025611143.755
fa.wikipedia29630.079111867.2568
it.wikipedia25150.020114414.03
id.wikipedia19860.04693997.0751
en.wiktionary18430.14778559.7201
bn.wikipedia14270.32823609.019
pt.wikipedia13750.01856856.3104
pl.wikipedia12620.02714034.5361
hi.wikipedia10450.069823312.4287
th.wikipedia7830.051433.6785
vi.wikipedia7000.05252918.0016
he.wikipedia6240.05625115.2011
ur.wikipedia6071.180839445.5642
Commons5750.05015746.7866
ko.wikipedia5650.04033888.4317
hu.wikipedia4850.053238.5521
m.mediawiki4791.41139416.31
nl.wikipedia4540.01752615.0125
uk.wikipedia4220.036628211.2484
sr.wikipedia3640.067830219.8593
az.wikipedia2900.060122314.3093
cs.wikipedia2470.02371094.609
sv.wikipedia2410.0129721.9715
fi.wikipedia2340.02031263.8207
el.wikipedia2330.03915011.4261
tr.wiktionary2110.435820560.9911
ar.wikisource2100.15794727.6897
simple.wikipedia2090.0558579.1175
ro.wikipedia2000.0261564.6218
ms.wikipedia1720.068423.0014
tr.wikipedia1720.0395857.0348
no.wikipedia1650.0262634.4547
tl.wikipedia1620.10128358.4247
my.wikipedia1541.80784113.4328
m.wikidata1320.0919677.2273
eu.wikipedia1300.771411734.7163
ca.wikipedia1200.0538434.7957
pt.wiktionary1130.270510380.0661
Meta-wiki1080.05871073.8568
mr.wikipedia1070.06754614.6885
Tue, Nov 20, 1:05 AM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Readers-Web-Kanbanana-Board-Old, Patch-For-Review, Mobile-Web-Settings, Readers-Web-Backlog, MobileFrontend

Fri, Nov 16

mpopov awarded T208909: [Bug] Update old nonuniformly distributed page_random values a Manufacturing Defect? token.
Fri, Nov 16, 5:36 PM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
Tbayer added a comment to T182235: Consolidate, simplify and cleanup data collection relating to Special:MobileOptions.

And the same for logged-in views:

Fri, Nov 16, 6:49 AM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Readers-Web-Kanbanana-Board-Old, Patch-For-Review, Mobile-Web-Settings, Readers-Web-Backlog, MobileFrontend
Tbayer updated subscribers of T182235: Consolidate, simplify and cleanup data collection relating to Special:MobileOptions.

@alexhollender asked about the percentage of mobile beta pageviews, so I re-ran the calculation from above (T182235#3833702 ) , correcting the queries a bit (in particular restricting it to webrequests that are pageviews):

Fri, Nov 16, 5:53 AM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Readers-Web-Kanbanana-Board-Old, Patch-For-Review, Mobile-Web-Settings, Readers-Web-Backlog, MobileFrontend
Tbayer added a comment to T209536: Hive query fails with local join.

Is this related to T206279 ?

Fri, Nov 16, 2:50 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics

Thu, Nov 15

Tbayer added a comment to T209598: Aggregate ReadingDepth data in a form suitable for interactive visualization.

@ovasileva and @Groceryheist , feel free to weigh in if there is anything missing or off.

Thu, Nov 15, 4:00 PM · Product-Analytics
Tbayer added a subtask for T205562: Ingest data aggregate ReadingDepth data into Druid : T209598: Aggregate ReadingDepth data in a form suitable for interactive visualization.
Thu, Nov 15, 3:59 PM · Readers-Web-Backlog (Tracking), Patch-For-Review, Analytics-Kanban, Analytics
Tbayer added a parent task for T209598: Aggregate ReadingDepth data in a form suitable for interactive visualization: T205562: Ingest data aggregate ReadingDepth data into Druid .
Thu, Nov 15, 3:59 PM · Product-Analytics
Tbayer renamed T205562: Ingest data aggregate ReadingDepth data into Druid from Ingest data into druid for readingDepth schema to Ingest data aggregate ReadingDepth data into Druid .
Thu, Nov 15, 3:58 PM · Readers-Web-Backlog (Tracking), Patch-For-Review, Analytics-Kanban, Analytics
Tbayer reopened T205562: Ingest data aggregate ReadingDepth data into Druid as "Open".

@Tbayer @Nuria

I loaded 1 month (Sept 2018) of ReadingDepth to Druid as a test. You can see it in Turnilo:
https://turnilo.wikimedia.org/#event_ReadingDepth/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjAAcpD4AnMF+Ij4+eFExPgB0UT4AWpLE2AAm3L6BpH4+4ZHRsVGJUakAvgC61UxQmkhoTi4a2twWTBkQbNhQWLgEZh0gdjS2MAi0EBrcAAoifgASklCYdPighsaD5t36IF2G5Bg43HBQ5Old9iAVTEgs0/jYEwi1IGznMIZOoNAAshMMPgpIgoMQ6hB7AgdCByJgYNh6EwWECICo4QikWkwOk0GIsfQakxNFCSBkACJ7Xr9ZpVElk4gZADKa08mMRyMIxAcmWeryYlAgdkoSBFnheCDeQA==

To load this test data set, I modified the EventLoggingToDruid.scala job to transform time measure fields into bucketized dimensions as suggested earlier in this task, so that we can discuss its use with some tangible examples. If you split by a time measure dimension, you'll see various branches that indicate orders of magnitude, like: O(10^3). An order of magnitude of O(10^3), in this case, is equivalent to the interval [1000 -> 9999]. O(10^5) = [100000 -> 999999]. In the specific case of ReadingDepth, as measurements are done in milliseconds, i.e. O(10^5) would mean 100 to 999 seconds. The value of each branch at any data point, corresponds to the number of events belonging to that range. Note that raw milliseconds values will still be available in Hive for deep analysis.

Please @Tbayer, have a look and let's discuss whether this is of any value.
Thanks!

Thu, Nov 15, 3:53 PM · Readers-Web-Backlog (Tracking), Patch-For-Review, Analytics-Kanban, Analytics
Tbayer created T209598: Aggregate ReadingDepth data in a form suitable for interactive visualization.
Thu, Nov 15, 3:42 PM · Product-Analytics

Wed, Nov 14

Tbayer added a comment to T209540: Audit special pages to identify hashing requirements.

Looks like there could be some synergy with the web team's work, see e.g. T198218 .

Wed, Nov 14, 11:57 PM · Product-Analytics, Growth-Team
Tbayer updated subscribers of T209503: [EventLogging Sanitization] Enable older-than-90-day purging of unsanitized EL database (event) in Hive.

Just to double-check: The information in the documentation that "Sanitization happens right after events are generated (with a couple hours lag)" is still current, right? In that case I don't think this will be a concern (although we will need to update some queries - CCing @Groceryheist regarding ReadingDepth).

Wed, Nov 14, 8:37 PM · Analytics-EventLogging, Analytics-Kanban

Nov 14 2018

Tbayer closed T208789: Identify pages to be bucketed in page schema linked data A/B test as Resolved.
Nov 14 2018, 8:17 AM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), SEO
Tbayer closed T208789: Identify pages to be bucketed in page schema linked data A/B test, a subtask of T208755: Launch A/B test for sameAs property, as Resolved.
Nov 14 2018, 8:17 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), SEO
Tbayer added a comment to T208789: Identify pages to be bucketed in page schema linked data A/B test.

@Tbayer, if you have time, please review this prior to launch. If not, I think we should be ok.

Nov 14 2018, 8:15 AM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), SEO
Tbayer updated the task description for T208755: Launch A/B test for sameAs property.
Nov 14 2018, 7:30 AM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), SEO
Tbayer closed T208909: [Bug] Update old nonuniformly distributed page_random values as Resolved.

Checked that the following sets of pages look quite uniformly distributed now:
enwiki: https://quarry.wmflabs.org/query/31221 (a version of [2] from the task description that actually completes on Quarry)
enwiki: https://quarry.wmflabs.org/query/31220
Commons: https://quarry.wmflabs.org/query/31218
dewiki: https://quarry.wmflabs.org/query/31219

Nov 14 2018, 7:27 AM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
Tbayer closed T208909: [Bug] Update old nonuniformly distributed page_random values, a subtask of T208789: Identify pages to be bucketed in page schema linked data A/B test, as Resolved.
Nov 14 2018, 7:27 AM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), SEO
Tbayer added a comment to T209422: Investigate the Sep 7 drop and the Oct 25 spike in iOS app's 7-day retention.

For reference, the implementation task for the underlying instrumentation: T126693

Nov 14 2018, 2:52 AM · Product-Analytics

Nov 13 2018

Tbayer added a comment to T209315: Enable Google Developer Access for SEO deployers.

This should be sorted out now. (Seems we still need to streamline and formalize the access granting process more.)

Nov 13 2018, 5:04 PM · Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), SEO
Tbayer added a comment to T204275: Understand traffic to Hindi Wikipedia in Madhya Pradesh during awareness campaign.

Thanks @Nuria !

Also, do we know the page to which the video was taking viewers in facebook? It will be worthy it to look at pageviews for just that one page and plot what we see.

According to T185584, the links we used for the campaign are:

Nov 13 2018, 8:57 AM · Hindi-Sites, Product-Analytics, New-Readers
Tbayer claimed T209051: ReadingDepth schema is whitelisting both session ids and page ids.

Still need to look into this with @ovasileva and possibly @Groceryheist .

Nov 13 2018, 5:50 AM · Analytics
Tbayer updated subscribers of T209051: ReadingDepth schema is whitelisting both session ids and page ids.
Nov 13 2018, 5:49 AM · Analytics
Tbayer added a comment to T209050: Print schema is whitelisting both session ids and page ids.

Discussed with @ovasileva today - we are going to remove the page IDs and keep the session IDs. I will submit a patch soon.

Nov 13 2018, 5:48 AM · Readers-Web-Backlog, Analytics
Tbayer updated subscribers of T209049: MobileWebSectionUsage schema is whitelisting both session ids and page ids.

Discussed with @ovasileva today - we are going to remove the session IDs and keep the page names. I will submit a patch soon.

Nov 13 2018, 5:47 AM · Product-Analytics, Analytics

Nov 12 2018

Tbayer added a comment to T208909: [Bug] Update old nonuniformly distributed page_random values.

Repeating query [4] from the task description, the distribution on ptwiki pages created on or before Dec 8, 2005 looks plausible now: https://quarry.wmflabs.org/query/31152
Will do the other checks by tomorrow evening PST (note that the Hive query [2] can't be re-run directly right now as it depends on the monthly Data Lake snapshot, but of course we can run it in MySQL/MariaDB elsewhere).

Nov 12 2018, 9:49 PM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
Tbayer added a comment to T106650: Enrich articles with schema.org metadata.

See now also T198946: Add Schema property 'sameAs' pointing to Wikidata entries (which also adds a few other schema.org properties, see T198946#4672325 for details)

Nov 12 2018, 7:59 AM · Reading-Admin, MediaWiki-General-or-Unknown, SEO
Tbayer added a comment to T206898: Make plan for counting Global South edits and editors .

@Tbayer, I've created a CSV file with country names, ISO codes, Global North/South classification, and MaxMind continents, tracked in a new wikimedia-research/canonical-data repo. It contains all the countries which appear in projectview_hourly, and I've carefully checked it to make sure the Global North/South classifications match the ones at meta:List of countries by regional classification.

Cool, thanks for sorting this all out and vetting it!

Nov 12 2018, 1:29 AM · Contributors-Analysis, Product-Analytics

Nov 11 2018

Restricted Application added a project to T54510: Echo should provide notifications about your revision being approved or rejected on wikis with FlaggedRevs enabled: Growth-Team.

I nominated this for the 2019 community wishlist survey (as a volunteer), although it remains to be seen whether it fits the scope.

Nov 11 2018, 6:04 PM · Growth-Team, Patch-For-Review, Collaboration-Team-Triage, MediaWiki-extensions-FlaggedRevs, Notifications
Tbayer added a comment to T105974: Follow up with Aaron about Reading patterns.

Context: https://meta.wikimedia.org/wiki/Research:Directed_diabetes_info-seeking_behavior_in_Wikipedia

Nov 11 2018, 6:02 PM · Reading-Admin

Nov 10 2018

Tbayer added a project to T209049: MobileWebSectionUsage schema is whitelisting both session ids and page ids: Product-Analytics.
Nov 10 2018, 3:44 AM · Product-Analytics, Analytics
Tbayer claimed T209049: MobileWebSectionUsage schema is whitelisting both session ids and page ids.

Yes, I can take care of this.

Nov 10 2018, 3:43 AM · Product-Analytics, Analytics

Nov 9 2018

Tbayer added a comment to T209087: [EventLogging Sanitization] Update EL sanitization white-list for field renames in EL schemas.

PS: and (in the name of the team) thanks for catching this!

Nov 9 2018, 4:55 PM · Product-Analytics, Reading-analysis, Analytics
Tbayer placed T209087: [EventLogging Sanitization] Update EL sanitization white-list for field renames in EL schemas up for grabs.

@mforns: I assume "you" in the task description refers to me (since you assigned the task to me). I didn't have anything to do with the original creation of the schema or the field renames in question, and am not among the schema's maintainers.
We'll likely discuss this in our team meeting later today - it's probably best if the involved analysts determine the precise list of field names to be added, although I'll be happy to help submitting the resulting whitelist patch as I did earlier this week in case of the Popups schema.

Nov 9 2018, 4:54 PM · Product-Analytics, Reading-analysis, Analytics
Tbayer added a comment to T206898: Make plan for counting Global South edits and editors .

@Tbayer,

The list of Global North countries I've been using is:

(
    "AD", "AL", "AT", "AX", "BA", "BE", "BG", "CH", "CY", "CZ",
    "DE", "DK", "EE", "ES", "FI", "FO", "FR", "FX", "GB", "GG",
    "GI", "GL", "GR", "HR", "HU", "IE", "IL", "IM", "IS", "IT",
    "JE", "LI", "LU", "LV", "MC", "MD", "ME", "MK", "MT", "NL",
    "NO", "PL", "PT", "RO", "RS", "RU", "SE", "SI", "SJ", "SK",
    "SM", "TR", "VA", "AU", "CA", "HK", "MO", "NZ", "JP", "SG",
    "KR", "TW", "US"
)

OK, here is the HiveQL expression determining GN I have been using for the past few years:

Nov 9 2018, 3:47 PM · Contributors-Analysis, Product-Analytics
Tbayer claimed T209050: Print schema is whitelisting both session ids and page ids.

I'll look into this next week with @ovasileva .

Nov 9 2018, 4:30 AM · Readers-Web-Backlog, Analytics
Tbayer added a comment to T208909: [Bug] Update old nonuniformly distributed page_random values.

Thanks for the feedback, @jcrespo. @pmiazga and @phuedx are working on a new PHP MySQL script per your advice. We’d greatly appreciate yours and your colleagues’ review of it and have some additional questions:

  1. We don’t know how long this script will take to run. Our initial draft query demonstrates page_random repopulation using the MySQL’s RAND(). Since we’re taking the trouble to write a maintenance script, we intend to use the preferred wfRandom() MediaWiki function instead which matches the implementation for newly inserted pages today. We estimate that about 1.5 million rows will be changed across projects, ~.6 million of those on enwiki. In your experience, how long does a change like this take?

Yes, this would be great to know - I have no idea myself either, but looking at the above mentioned case in T166733#4709400 , it seems that batched updating of 16.8 million rows (across 9 tables) took 11 hours there.

Nov 9 2018, 2:36 AM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
Tbayer updated subscribers of T208909: [Bug] Update old nonuniformly distributed page_random values.

Thanks for the feedback, @jcrespo. @pmiazga and @phuedx are working on a new PHP MySQL script per your advice. We’d greatly appreciate yours and your colleagues’ review of it and have some additional questions:

[...]

  1. Can you help us identify parties we should coordinate with based on your experience making similar changes? In your comment you mentioned #wikimedia-operations if the update was quick but per the previous bullet, we’re unsure. Does ~1.5 million rows seem like a long running change? Would it be necessary or advisable to divide this change into enwiki and non-enwiki updates? Lastly, are Anomie’s T166733 and T188132 scripts running constantly or intermittently? They seem to be scheduled for weeks! Would they interfere with a simultaneous update to page_random?

FWIW, it seems that @Anomie's updates are not operating on the page table that we are concerned with here, but on different tables (namely image for T188132, and revision, archive, logging, ipblocks, image, oldimage, filearchive, protected_titles and recentchanges for T166733, according to T166733#4709400 - @Anomie, can you confirm?).

Nov 9 2018, 2:32 AM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
Tbayer added a comment to T208909: [Bug] Update old nonuniformly distributed page_random values.

AND rev_timestamp < '20050601000000'

We didn't have CI back then, we deployed major versions after months of delay. I think the correct end date is the deployment of MediaWiki 1.5, i.e. 2005-07-07 per https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_4#July_6 . This agrees with the query results I posted previously.

Nov 9 2018, 2:04 AM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown

Nov 8 2018

Tbayer updated the task description for T208909: [Bug] Update old nonuniformly distributed page_random values.
Nov 8 2018, 9:03 AM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown

Nov 7 2018

Tbayer updated the task description for T208909: [Bug] Update old nonuniformly distributed page_random values.
Nov 7 2018, 8:37 PM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
Tbayer moved T199157: [Spike ??hrs] Sticky header instrumentation from Doing to Stalled on the Product-Analytics board.
Nov 7 2018, 6:18 PM · Product-Analytics, Reading-analysis, Analytics, Readers-Web-Backlog, MinervaNeue, Design
Tbayer created T208977: Prepare readers section of Audiences quarterly metrics review.
Nov 7 2018, 6:17 PM · Product-Analytics
Tbayer moved T202791: Update Audiences page and Key Product Metrics with October 2018 Readers data from Blocked to Next Up on the Product-Analytics board.
Nov 7 2018, 6:15 PM · Product-Analytics