Page MenuHomePhabricator

mforns (Marcel Ruiz Forns)
Software Engineer @ Analytics

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 7 2014, 8:52 PM (336 w, 4 d)
Availability
Available
IRC Nick
mforns
LDAP User
Mforns
MediaWiki User
Unknown

Recent Activity

Mon, Apr 19

mforns added a comment to T280293: Delete UpperCased eventlogging legacy directories in /wmf/data/event 90 days from 2021-04-15 (after 2021-07-14).

I don't see that happening in the event_sanitized base directory.
Is refine_sanitize not going to do that as well?

Mon, Apr 19, 1:22 PM · Analytics

Thu, Apr 15

mforns added a comment to T273313: [SessionLength] SessionLength Documentation.

I added a paragraph in the caveats section of the documentation:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength#Ad_blockers
Please, let me know if you think that is enough!

Thu, Apr 15, 3:58 PM · Better Use Of Data
mforns added a comment to T280255: [session length] Drop test data, tables and dashboards; and clean production data.

This is done!

Thu, Apr 15, 2:16 PM · Better Use Of Data
mforns created T280256: [session length] Change domain of event collection to avoid ad-blocker issue.
Thu, Apr 15, 2:14 PM · Better Use Of Data
mforns moved T280255: [session length] Drop test data, tables and dashboards; and clean production data from Inbox to Sign-off on the Better Use Of Data board.
Thu, Apr 15, 1:57 PM · Better Use Of Data
mforns created T280255: [session length] Drop test data, tables and dashboards; and clean production data.
Thu, Apr 15, 1:56 PM · Better Use Of Data
mforns created T280254: [session length] Investigate slight drop at sessions of 30 minutes or more.
Thu, Apr 15, 1:48 PM · Better Use Of Data

Wed, Apr 14

mforns moved T238138: VirtualPageView Event Platform Migration from Next Up to In Progress on the Analytics-Kanban board.
Wed, Apr 14, 3:24 PM · Analytics-Kanban, Analytics, Event-Platform, Epic, Better Use Of Data, Product-Infrastructure-Team-Backlog

Fri, Apr 2

mforns added a comment to T278815: Produce a list of wiki projects ranked by number of eligible voters in Board elections.

I have what I think are good news. :)

While it is exciting to get more accurate results and I have been the first one proposing to fine tune the query... What our team needs is in fact only a ranking of wiki projects. The approximation of your first query with the February data is good enough already. If you can run a full query (not capped to top 100 results) based on the March data, that will be great and more than enough.

OK! We'll do as soon as the snapshot for March is ready.

Fri, Apr 2, 3:55 PM · Analytics

Wed, Mar 31

mforns moved T277512: Optimize intermediate session length data set and dashboard from Doing to Sign-off on the Better Use Of Data board.
Wed, Mar 31, 6:40 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns moved T276502: [SessionLength] Change sampling rate to 10% from QA/Review to Sign-off on the Better Use Of Data board.
Wed, Mar 31, 6:38 PM · Product-Analytics, Better Use Of Data

Tue, Mar 30

mforns added a comment to T278815: Produce a list of wiki projects ranked by number of eligible voters in Board elections.

Hi!
We tried this query to extract the rank of wikis per voter base:

WITH base_data AS (
    SELECT
        wiki_db,
        event_user_id,
        MAX(event_user_revision_count) AS rev_count,
        COUNT(1) AS revs_last_6_months
    FROM wmf.mediawiki_history
    WHERE
        snapshot = '2021-02' AND
        event_entity = 'revision' AND
        event_type = 'create' AND
        event_timestamp >= '2020-09-01' AND
        NOT ARRAY_CONTAINS(event_user_is_bot_by, 'group') AND
        event_user_id IS NOT NULL
    GROUP BY
        wiki_db,
        event_user_id
    HAVING
        revs_last_6_months >= 30 AND
        rev_count >= 300
)
SELECT
    wiki_db,
    COUNT(1) AS voters
FROM base_data
GROUP BY wiki_db
ORDER BY voters DESC
LIMIT 100
;

The results are the following:

wiki_db	voters
enwiki	20659
wikidatawiki	9098
commonswiki	8542
dewiki	4820
frwiki	3513
jawiki	2986
eswiki	2603
ruwiki	2558
zhwiki	2057
itwiki	1895
plwiki	1111
ptwiki	1066
nlwiki	921
metawiki	847
ukwiki	752
hewiki	751
fawiki	684
kowiki	602
enwiktionary	541
svwiki	491
cswiki	489
arwiki	481
viwiki	401
huwiki	392
trwiki	389
idwiki	386
fiwiki	378
cawiki	309
hywiki	287
nowiki	238
thwiki	227
mediawikiwiki	226
enwikisource	200
srwiki	191
incubatorwiki	187
bnwiki	171
elwiki	168
frwikisource	166
bgwiki	154
azwiki	147
rowiki	136
dawiki	135
frwiktionary	133
hrwiki	130
simplewiki	124
etwiki	106
enwikivoyage	96
mswiki	89
euwiki	89
skwiki	86
hiwiki	77
glwiki	73
eowiki	72
slwiki	71
specieswiki	71
lvwiki	67
enwikiquote	67
tawiki	66
ltwiki	65
kawiki	65
mlwiki	65
enwikibooks	65
bewiki	65
dewiktionary	59
mkwiki	54
zh_yuewiki	54
dewikisource	52
enwikiversity	51
ruwiktionary	49
tawikisource	48
itwikisource	47
plwiktionary	45
urwiki	42
plwikisource	41
kkwiki	39
tewiki	37
mywiki	35
ruwikivoyage	35
ruwikisource	35
ckbwiki	34
zhwikisource	33
ttwiki	33
lawiki	33
sqwiki	32
dewikibooks	32
mrwiki	31
itwikiquote	30
sourceswiki	30
hewikisource	30
ruwikinews	29
ruwikimedia	28
sawikisource	28
enwikinews	28
bawiki	28
afwiki	28
bnwikisource	28
cswiktionary	27
scowiki	27
aswiki	26
cywiki	26

We made some assumptions, please confirm or contest them:

  • Edits to all namespaces count
  • Editors belonging to the BOT group don't count
  • Cross-project edits don't count
  • Edits to deleted pages count
Tue, Mar 30, 7:52 PM · Analytics

Fri, Mar 26

mforns added a comment to T273246: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage.

@EYener, I reviewed the changes. They look a lot better, thanks.
The only changes needed are very minor. Please, have a look and let me know.
I'd ask to combine all commits for a single schema into one atomic commit,
I explain what I mean in the comments. Let me know if I can help :]
Cheers

Fri, Mar 26, 2:51 PM · Fundraising-Backlog, FR-Tech-Analytics, Analytics
mforns added a comment to T273246: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage.

Hi @EYener, sorry for the delay, I've been a couple days off. Looking now

Fri, Mar 26, 2:14 PM · Fundraising-Backlog, FR-Tech-Analytics, Analytics

Thu, Mar 25

mforns reassigned T277348: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors from mforns to JAllemandou.
Thu, Mar 25, 6:56 PM · Product-Analytics, Analytics
mforns moved T277512: Optimize intermediate session length data set and dashboard from Next Up to Done on the Analytics-Kanban board.
Thu, Mar 25, 6:24 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns added a comment to T277512: Optimize intermediate session length data set and dashboard.

As the main problem that we had in the session length dashboard has been significantly mitigated by https://gerrit.wikimedia.org/r/672541, I consider this task done.
There are other suggested optimizations that are tackled in other tasks, those will continue to be worked on there.
For the suggestions that don't have their own task yet, please create it, if you think they need to be implemented in the future.
Thanks!

Thu, Mar 25, 6:24 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns updated the task description for T277512: Optimize intermediate session length data set and dashboard.
Thu, Mar 25, 6:19 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns added a comment to T277785: Add "did edit" field to pageview_actor.

I think this would be super useful!
We could use this information to filter out rows in reading data sets that we inted to make public, like pageviews per article per country.
Not reporting on sessions that included editing would break the bridge between the data set and the public wiki databases, thus allowing to publish data with more granularity!
And the amount of data that we'd be loosing would be orders of magnitude smaller than the total.

Thu, Mar 25, 4:40 PM · Analytics

Mar 19 2021

mforns committed rARPQcb240ca542b6: Filter bot traffic out of metrics (authored by awight).
Filter bot traffic out of metrics
Mar 19 2021, 6:12 PM

Mar 15 2021

mforns added a comment to T277512: Optimize intermediate session length data set and dashboard.

@cchen once we merge and deploy the optimization above, the session length dashboard will need some adjustments.
Maybe we can set up a meeting and pair on them, I think it will be the fastest. It shouldn't be a big thing, maybe 30 minutes.

Mar 15 2021, 9:49 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns added a comment to T277512: Optimize intermediate session length data set and dashboard.

I think solving #2 will be enough for the dashboard to perform fine for several months, maybe a couple years.
I'm already working on that. It should be a small change to the way we store data in the intermediate table, that will need a couple adjustments to the dashboard queries.
It won't alert the contents of the data set, just the format. And thus, the results (data, charts, etc.) of the dashboard will not be altered.

Mar 15 2021, 9:46 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns created T277512: Optimize intermediate session length data set and dashboard.
Mar 15 2021, 9:42 PM · Analytics-Kanban, Better Use Of Data, Analytics

Mar 12 2021

mforns committed rARPQ692145722700: Fix typo: no "performer" field (authored by awight).
Fix typo: no "performer" field
Mar 12 2021, 6:11 PM

Mar 11 2021

mforns added a comment to T277193: wgEventStreams (EventStreamConfig) should support per wiki overrides.

Could we key the config by stream name, and have an extra key "regexp_streams" (better name to be found) that contains an integer-indexed list of all regex stream configs?
No idea if that fulfills the requirements for stream config discovery, but I imagined that whoever is trying to discover a stream config, could first try by key, and then try accessing the "regexp_streams" section and traverse that?

Mar 11 2021, 4:55 PM · Better Use Of Data, Analytics, Event-Platform

Mar 10 2021

mforns moved T276502: [SessionLength] Change sampling rate to 10% from To Do to Doing on the Better Use Of Data board.
Mar 10 2021, 7:16 PM · Product-Analytics, Better Use Of Data

Mar 8 2021

mforns updated the task description for T271164: DesktopWebUIActionsTracking Event Platform Migration.
Mar 8 2021, 7:46 PM · MW-1.36-notes (1.36.0-wmf.34; 2021-03-09), Analytics-Kanban, Patch-For-Review, Analytics, Event-Platform
mforns updated the task description for T267347: MobileWebUIActionsTracking Event Platform Migration.
Mar 8 2021, 7:46 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics
mforns updated the task description for T267351: SuggestedTagsAction Event Platform Migration.
Mar 8 2021, 4:29 PM · MW-1.36-notes (1.36.0-wmf.34; 2021-03-09), Patch-For-Review, Analytics-Kanban, Structured-Data-Backlog, Event-Platform, Analytics
mforns added a comment to T276636: [SessionLength] Add ability to adjust sampling rate per wiki.

Last Friday I learned that we can already configure sampling rate per wiki!
The changes needed to make this useful in the session length project would be:

  • storing the session length field in the session_tick schema (or in all schemas)
  • using that rate in the session length metric computations
Mar 8 2021, 3:53 PM · Product-Analytics, Better Use Of Data
mforns added a comment to T276502: [SessionLength] Change sampling rate to 10%.

@kzimmerman
I talked with @Ottomata last Friday, about scaling up EventGate, and we decided to do it now.
By our discussion in our previous meeting, I think that is OK with you.
So, @Ottomata already doubled the number of EventGate instances, and we can move on to increasing session_tick sampling rate to 10%, if we want.
I went ahead and changed the title of this task to 10% (and modified the corresponding code change), but LMK if there's any problem with that!
If we start collecting events at 10%, we'll already be able to report with more accuracy on smaller wikis.

Mar 8 2021, 3:18 PM · Product-Analytics, Better Use Of Data
mforns renamed T276502: [SessionLength] Change sampling rate to 10% from [SessionLength] Change sampling rate to 5% to [SessionLength] Change sampling rate to 10%.
Mar 8 2021, 3:02 PM · Product-Analytics, Better Use Of Data

Mar 5 2021

mforns added a comment to T273246: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage.

I realize that this data is past 90 days old at this point and is no longer available in Hive. Is there any mechanism for reviving this data?

Unfortunately, unsanitized data older than 90 days is irrevocably deleted to abide to our data retention guidelines.
In this case, the oldest date for which we have data for WikipediaPortal schema is Dec 5th 2020.
Note that, every day that passes, the oldest day of data will be deleted.

Mar 5 2021, 5:25 PM · Fundraising-Backlog, FR-Tech-Analytics, Analytics

Mar 4 2021

mforns added a comment to T276502: [SessionLength] Change sampling rate to 10%.

Wow @Mholloway, you're ninja-fast!
Thanks for the patch :-)

Mar 4 2021, 9:02 PM · Product-Analytics, Better Use Of Data

Mar 3 2021

mforns updated the task description for T274322: Clean up issues with jobs after Hadoop Upgrade.
Mar 3 2021, 3:36 PM · Patch-For-Review, Analytics-Kanban, Analytics

Feb 26 2021

mforns updated subscribers of T273246: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage.

Hi @Jdrewniak and @mpopov
I ping you here to discuss about WikipediaPortal schema.
I've seen you listed as schema owners on the schema's talk page, and I supposed you worked on the schema creation and instrumentation.

Feb 26 2021, 5:10 PM · Fundraising-Backlog, FR-Tech-Analytics, Analytics

Feb 25 2021

mforns moved T272052: Traffic anomalies: Factor out list of countries into a dedicated Hive table from Next Up to Done on the Analytics-Kanban board.
Feb 25 2021, 4:58 PM · Analytics-Kanban, SRE, Traffic, Analytics
mforns claimed T272052: Traffic anomalies: Factor out list of countries into a dedicated Hive table.
Feb 25 2021, 4:58 PM · Analytics-Kanban, SRE, Traffic, Analytics
mforns moved T273821: Growth: delete data older than 90 days from Next Up to Done on the Analytics-Kanban board.
Feb 25 2021, 4:56 PM · Analytics-Kanban, Growth-Scaling, Product-Analytics, Analytics, Growth-Team
mforns moved T274297: Growth: remove deletion timers for Growth's sanitized EL tables from Next Up to Done on the Analytics-Kanban board.
Feb 25 2021, 4:56 PM · Analytics-Kanban, Growth-Scaling, Product-Analytics, Analytics, Growth-Team
mforns added a comment to T273246: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage.

Hi all! I reviewed the include-list patches and left some comments there.
Please, don't feel overwhelmed by the review! Let's discuss and arrive to a solution :]
Thanks for doing this changes.

Feb 25 2021, 4:03 PM · Fundraising-Backlog, FR-Tech-Analytics, Analytics
mforns renamed T263030: Make data quality stats alert only if anomalous metrics change from Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga to Make data quality stats alert only if anomalous metrics change.
Feb 25 2021, 1:02 AM · Analytics

Feb 24 2021

mforns added a comment to T273789: Sanitize and ingest all event tables into the event_sanitized database.

Would it be scope creep to add a way for Refine to not traverse part of a directory tree?
This way we can have 2 sanitization jobs that go over the event database base directory, but not repeat all the RefineTarget extraction.

Feb 24 2021, 7:55 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics

Feb 23 2021

mforns added a comment to T275171: Growth: shorten welcome survey retention to 90 days.

Same, sorry for the misunderstanding.

Feb 23 2021, 3:56 PM · Growth-Team (Current Sprint), Analytics-Radar, Growth-Scaling, Product-Analytics
mforns added a comment to T275172: Growth: update welcome survey aggregation schedule.

Oh, we groomed this task and assigned it to me by mistake, thanks for fixing :]

Feb 23 2021, 3:55 PM · Patch-For-Review, Growth-Team (Current Sprint), Product-Analytics (Kanban), Analytics-Radar, Growth-Scaling
mforns added a comment to T274297: Growth: remove deletion timers for Growth's sanitized EL tables.

I created 2 patches to solve this, they are +2'd, we're waiting for deployment.
By mistake I assigned them to the other phab task (T273821).
They are: https://gerrit.wikimedia.org/r/c/operations/puppet/+/665326 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/665328

Feb 23 2021, 3:45 PM · Analytics-Kanban, Growth-Scaling, Product-Analytics, Analytics, Growth-Team

Feb 19 2021

mforns added a comment to T273826: Growth: remove Homepage and Help Panel schemas from the schema whitelist.

I think this is done!

Feb 19 2021, 2:06 PM · Product-Analytics (Kanban), Analytics-Radar, Growth-Team (Current Sprint)
mforns claimed T273821: Growth: delete data older than 90 days .
Feb 19 2021, 2:06 PM · Analytics-Kanban, Growth-Scaling, Product-Analytics, Analytics, Growth-Team
mforns added a comment to T273821: Growth: delete data older than 90 days .

The data has been deleted!
Once those patches get merged (unused jobs), we can close this task.

Feb 19 2021, 2:03 PM · Analytics-Kanban, Growth-Scaling, Product-Analytics, Analytics, Growth-Team

Feb 18 2021

mforns moved T273116: Create Oozie job for session length from Doing to QA/Review on the Better Use Of Data board.
Feb 18 2021, 11:06 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns moved T273116: Create Oozie job for session length from In Progress to In Code Review on the Analytics-Kanban board.
Feb 18 2021, 11:06 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns updated subscribers of T273116: Create Oozie job for session length.

I finished and tested the Oozie job, and seems to be working fine!
@Mayakp.wiki and I sync'ed up on data checks and, while Maya is making sure the data looks good, I will push the code for CR with my team.
One question to @jlinehan and @Mholloway: When should the Oozie job start computing? Do we want to wait until we have the events flowing from all wikis at 1/10, or should we start already? Thanks!

Feb 18 2021, 10:23 PM · Analytics-Kanban, Better Use Of Data, Analytics

Feb 16 2021

mforns added a comment to T274823: Big increase in traffic for projects except 'wikipedia' family since Feb 14th.

We could add a tag to pageviews generated by actors with high-trafic IPs.
It would not change the way we process, count or classify traffic today,
but we could use it to filter out this type of traffic when doing analyses like traffic anomalies.

Feb 16 2021, 7:18 PM · Analytics-Radar, Product-Analytics (Kanban)
mforns committed rARPQ9cfd0dc6a6a6: Fix case of metric path (authored by awight).
Fix case of metric path
Feb 16 2021, 3:18 PM
mforns added a comment to T274823: Big increase in traffic for projects except 'wikipedia' family since Feb 14th.

There clearly seem to have a small number of IPs making most requests for projects having seen a change (en.wikipedia, commons.wikipedia` for instance).

Thanks for looking into this! That makes sense. It's curious how the automated traffic detection didn't catch those, if they share IPs. Maybe we can improve the heuristics for this particular case.

Feb 16 2021, 1:22 PM · Analytics-Radar, Product-Analytics (Kanban)
mforns updated subscribers of T273454: Compensate for sampling.

@awight
Re. graphite: I haven't ever dealt with back-filling graphite metrics. I'm not sure they can be backfilled, or purged by a given time range. Maybe @elukey knows?
Re. Reportupdater queries: Do you mean the TSV reports generated by those queries? That's easier, we could just delete the reports' contents since Jan 1st. Reportupdater would pick up from there and rerun all dates automatically. If you confirm that's what you want, I'll do that.

Feb 16 2021, 1:19 PM · WMDE-TechWish-Sprint-2021-03-03, WMDE-TechWish-Sprint-2021-02-17, WMDE-TechWish (Sprint-2021-02-03), MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), Analytics-Radar, WMDE-Templates-FocusArea
mforns added a comment to T274880: Deployment access request for some analytics repos.

I think in general that's the way we should go! Give the teams the capability to test, deploy and manage their jobs independently. We are accumulating more and more data sets, and we don't scale to handle them all in time.
However, there's a lot of things we should get in place before that, like a data governance solution (Atlas?), a cluster fine-grained auth system (Ranger?), a comprehensive and flexible scheduler (Airflow?), and probably a cluster resource manager.
In the meantime, though, I think giving one of you guys merge rights to some of our repos would be ok! Other opinions from the team?

Feb 16 2021, 12:49 PM · Analytics, WMDE-TechWish

Feb 15 2021

mforns created T274823: Big increase in traffic for projects except 'wikipedia' family since Feb 14th.
Feb 15 2021, 9:40 PM · Analytics-Radar, Product-Analytics (Kanban)
mforns added a comment to T273789: Sanitize and ingest all event tables into the event_sanitized database.

Yes, or we can have another instance of a sanitization job, that reads from a separate include-list specific for non EventLogging data sets?

Feb 15 2021, 8:00 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics
mforns added a comment to T273313: [SessionLength] SessionLength Documentation.

@mpopov Awesome summary/intro, liked it very much.

Feb 15 2021, 7:58 PM · Better Use Of Data

Feb 12 2021

mforns added a comment to T273821: Growth: delete data older than 90 days .

I just merged the patch removal of the growth schemas from the include-list (T273826).
When that is deployed, I will delete the 4 tables from the event_sanitized database.
And also remove some puppet code that was purging those tables after 270 days.
Will ping you when done.
Cheers!

Feb 12 2021, 7:11 PM · Analytics-Kanban, Growth-Scaling, Product-Analytics, Analytics, Growth-Team

Feb 10 2021

mforns moved T273116: Create Oozie job for session length from To Do to Doing on the Better Use Of Data board.
Feb 10 2021, 7:04 PM · Analytics-Kanban, Better Use Of Data, Analytics

Feb 9 2021

mforns added a comment to T273789: Sanitize and ingest all event tables into the event_sanitized database.

@Ottomata
Is a sanitization refine job needed?
I thought the current sanitization job was sanitizing everything that is inside the event database.
My idea was that teams should just create a patch for EL sanitization white-list, to add the new table/fields to be kept indefinitely.
Then we would review it and merge.
Am I missing sth?

Feb 9 2021, 1:17 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics

Feb 8 2021

mforns added a comment to T273741: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons.

If it's an app, it would need to be very popular.
Maybe Aarogya Setu, the app for reducing Covid infections?
IIUC it's mandatory in India.

Feb 8 2021, 7:04 PM · Commons, Traffic, SRE
mforns added a comment to T273741: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons.

The crazy request volume starts on July 2020
https://pageviews.toolforge.org/mediaviews/?project=commons.wikimedia.org&platform=&referer=all-referers&start=2020-01-01&end=2020-12-31&files=AsterNovi-belgii-flower-1mb.jpg

Feb 8 2021, 6:00 PM · Commons, Traffic, SRE
mforns added a comment to T273789: Sanitize and ingest all event tables into the event_sanitized database.

@Ottomata
For new streams, shouldn't the stream owners work on this?

Feb 8 2021, 4:54 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics

Feb 4 2021

mforns added a comment to T267348: PrefUpdate Event Platform Migration.

Starting this migration now!

Feb 4 2021, 11:28 AM · MW-1.36-notes (1.36.0-wmf.34; 2021-03-09), Patch-For-Review, Analytics, Product-Data-Infrastructure, Event-Platform
mforns placed T253393: Revamp analytics.wikimedia.org data portal & landing page up for grabs.
Feb 4 2021, 11:17 AM · Epic, Product-Analytics, Analytics

Feb 3 2021

mforns added a comment to T272069: [Spike] What should our sampling strategy be for session_tick?.

I still think that we can take 1/10 sampling rate, but nevertheless it would be nice to start at 1/100, just for 1 or 2 days, to be safe... If my projections are *not* correct, this might have the potential to collapse parts of the data collection/processing system. But, as @mpopov said, good news is that we can change the production sampling rate just with a MediaWiki-config change, we don't need to wait for a full MediaWiki deployment train. So, tl;dr, yes 1/10, but let's roll it our progressively.

Feb 3 2021, 8:54 PM · Product-Analytics (Kanban), Product-Data-Infrastructure, Better Use Of Data
mforns added a comment to T273789: Sanitize and ingest all event tables into the event_sanitized database.

mediawiki_client_session_tick, IIUC, is not supposed to be kept indefinitely. Instead, we want to keep its aggregated/sessionized intermediate table that will power analytical queries through Hive/Presto/Superset. See: T273116

Feb 3 2021, 7:53 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics
mforns moved T273116: Create Oozie job for session length from Next Up to In Progress on the Analytics-Kanban board.
Feb 3 2021, 7:53 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns moved T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues from Ready to Deploy to Done on the Analytics-Kanban board.
Feb 3 2021, 7:53 PM · Analytics-Kanban, Analytics
mforns added a comment to T273313: [SessionLength] SessionLength Documentation.

Heya! I refactored the session length doc in Wikitech, and updated it.
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength
I tried to introduce some structure but still keep everything that was already there.
Also, I added everything that came to my mind, but it will surely be incomplete and biased.
So, please, feel free to correct/add/remove/suggest!
Also, let me know if you'd like to change the location of the page.

Feb 3 2021, 3:40 PM · Better Use Of Data

Feb 2 2021

mforns added a comment to T273246: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage.

Please, add the schemas (and fields) that you want to be kept indefinitely to the include-list in the refinery repository under static-data/eventlogging/whitelist.yaml. You can create a Gerrit patch with those changes, and add any of us Analytics as a reviewer (you can add me for this one). Maybe this documentation can help you guys decide which fields to keep and discard. We Analytics will also review, and let you know if we see any issues.

Feb 2 2021, 2:52 PM · Fundraising-Backlog, FR-Tech-Analytics, Analytics

Feb 1 2021

mforns added a comment to T272973: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance.

Awesome summary @elukey Thanks!

Feb 1 2021, 9:04 PM · Analytics-Kanban, Analytics

Jan 28 2021

mforns created T273216: Druid loading of navigationtiming gets stuck.
Jan 28 2021, 6:53 PM · Analytics
mforns created T273215: Filter out <lang>.wikidata requests from pageview definition.
Jan 28 2021, 6:51 PM · Analytics
mforns moved T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Jan 28 2021, 4:28 PM · Analytics-Kanban, Analytics
mforns moved T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues from Next Up to In Code Review on the Analytics-Kanban board.
Jan 28 2021, 4:27 PM · Analytics-Kanban, Analytics
mforns added a project to T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues: Analytics-Kanban.
Jan 28 2021, 4:27 PM · Analytics-Kanban, Analytics
mforns moved T272741: Superset presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event/... from In Code Review to Done on the Analytics-Kanban board.
Jan 28 2021, 4:26 PM · Analytics-Kanban, Product-Analytics, Analytics

Jan 27 2021

mforns created T273116: Create Oozie job for session length.
Jan 27 2021, 9:15 PM · Analytics-Kanban, Better Use Of Data, Analytics
mforns updated the task description for T271164: DesktopWebUIActionsTracking Event Platform Migration.
Jan 27 2021, 9:08 PM · MW-1.36-notes (1.36.0-wmf.34; 2021-03-09), Analytics-Kanban, Patch-For-Review, Analytics, Event-Platform
mforns updated the task description for T267347: MobileWebUIActionsTracking Event Platform Migration.
Jan 27 2021, 9:08 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics
mforns updated the task description for T271208: NavigationTiming Extension schemas Event Platform Migration.
Jan 27 2021, 9:08 PM · MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), Analytics-Kanban, Patch-For-Review, Analytics-EventLogging, Performance-Team, Event-Platform, Analytics
mforns updated the task description for T271208: NavigationTiming Extension schemas Event Platform Migration.
Jan 27 2021, 4:58 PM · MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), Analytics-Kanban, Patch-For-Review, Analytics-EventLogging, Performance-Team, Event-Platform, Analytics
mforns added a comment to T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues.

After some tests, I think the problem lies in the code:

if (spark.conf.get("spark.master") != "yarn") {
    sys.exit(if (success) 0 else 1)
}

When we execute with master=yarn (which is the prod setting) the job will not return any exit code, even if the driver is an-launcher1002 (deployMode=client).
Changed this snippet to:

if (spark.conf.get("spark.master") != "yarn" ||
    spark.conf.get("spark.submit.deployMode") == "client") {
    sys.exit(if (success) 0 else 1)
}

Tested it, and seems to do the trick.
What I'm not aware of is, why can't we always return the exit code?
Maybe we can discuss this on the CR.
Cheers!

Jan 27 2021, 4:19 PM · Analytics-Kanban, Analytics

Jan 26 2021

mforns claimed T271568: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues.

Weird...
In puppet, eventlogging_to_druid_job.pp uses $deploy_mode = 'client'.
Plus, HiveToDruid has:

// Exit with proper code only if not running in YARN.
if (spark.conf.get("spark.master") != "yarn") {
    sys.exit(if (success) 0 else 1)
}

Plus DataFrameToDruid (called by HiveToDruid) seems to be propagating the error correctly.
It doesn't seem like something we forgot to setup, but rather a bug.
Will look more into it.

Jan 26 2021, 3:11 PM · Analytics-Kanban, Analytics
mforns moved T271208: NavigationTiming Extension schemas Event Platform Migration from Next Up to In Progress on the Analytics-Kanban board.
Jan 26 2021, 2:49 PM · MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), Analytics-Kanban, Patch-For-Review, Analytics-EventLogging, Performance-Team, Event-Platform, Analytics
mforns added a project to T271208: NavigationTiming Extension schemas Event Platform Migration: Analytics-Kanban.
Jan 26 2021, 2:49 PM · MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), Analytics-Kanban, Patch-For-Review, Analytics-EventLogging, Performance-Team, Event-Platform, Analytics
mforns added a comment to T272741: Superset presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event/....

@kzimmerman I've checked and I could not find your username (kzeta right?) in the analytics-privatedata-users group.
That's probably why you can not access the session length data.
We should add you there. Created a task: T272982

Jan 26 2021, 2:32 PM · Analytics-Kanban, Product-Analytics, Analytics
mforns created T272982: Add kzeta to analytics-privatedata-users.
Jan 26 2021, 2:32 PM · SRE, SRE-Access-Requests, Analytics

Jan 25 2021

mforns updated the task description for T271208: NavigationTiming Extension schemas Event Platform Migration.
Jan 25 2021, 10:14 PM · MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), Analytics-Kanban, Patch-For-Review, Analytics-EventLogging, Performance-Team, Event-Platform, Analytics
mforns updated the task description for T267347: MobileWebUIActionsTracking Event Platform Migration.
Jan 25 2021, 8:56 PM · Patch-For-Review, Analytics-Kanban, Event-Platform, Analytics
mforns updated the task description for T271164: DesktopWebUIActionsTracking Event Platform Migration.
Jan 25 2021, 8:56 PM · MW-1.36-notes (1.36.0-wmf.34; 2021-03-09), Analytics-Kanban, Patch-For-Review, Analytics, Event-Platform
mforns updated the task description for T267351: SuggestedTagsAction Event Platform Migration.
Jan 25 2021, 8:23 PM · MW-1.36-notes (1.36.0-wmf.34; 2021-03-09), Patch-For-Review, Analytics-Kanban, Structured-Data-Backlog, Event-Platform, Analytics
mforns moved T272177: Some refined events folders contain no data while they should from Next Up to Done on the Analytics-Kanban board.
Jan 25 2021, 7:44 PM · Analytics-Kanban, Event-Platform, Analytics
mforns moved T272741: Superset presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event/... from Next Up to In Code Review on the Analytics-Kanban board.
Jan 25 2021, 7:43 PM · Analytics-Kanban, Product-Analytics, Analytics
mforns added a comment to T272741: Superset presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event/....

Please, could you pass this requirements to data set creators?

Jan 25 2021, 7:34 PM · Analytics-Kanban, Product-Analytics, Analytics

Jan 21 2021

mforns moved T271455: Roll-up raw sessionTick data into distribution from Next Up to In Progress on the Analytics-Kanban board.
Jan 21 2021, 5:54 PM · Analytics-Radar, Product-Data-Infrastructure, Product-Analytics, Better Use Of Data