Page MenuHomePhabricator

mforns (Marcel Ruiz Forns)
Software Engineer @ Analytics

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 7 2014, 8:52 PM (262 w, 2 d)
Availability
Available
IRC Nick
mforns
LDAP User
Mforns
MediaWiki User
Unknown

Recent Activity

Fri, Nov 15

mforns added a comment to T238389: Rerun pingback reports to categorize software versions correctly..

Thanks for creating this task @CCicalese_WMF! I completely forgot about the reruns.
I already deleted the data that needs rerun. Reportupdater will start computing the missing data in the next hour.
As there are lots of data points to rerun and the queries are non trivial, this computation might take some days, possibly 1 week.
Let's close this task only when all data is visible and OK in the Dashboard :]
Cheers!

Fri, Nov 15, 3:35 PM · Analytics

Thu, Nov 14

mforns added a comment to T234484: Add data quality metric: traffic variations per country.

@ssingh
I'm trying to match a first draft of the traffic_per_country metric with the outage data that you put together.
Also, it would be great to have false positives examples to do the same and see how the metric behaves in such cases.
Do you have any false positive examples that I can use?
Thanks a lot!

Thu, Nov 14, 3:27 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics

Tue, Nov 12

mforns moved T234484: Add data quality metric: traffic variations per country from Next Up to In Code Review on the Analytics-Kanban board.
Tue, Nov 12, 4:05 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics
mforns added a comment to T234484: Add data quality metric: traffic variations per country.

I added a first draft of the metric to the data quality pipeline (see patch above), and added a chart to the data quality dashboard in Superset.
https://superset.wikimedia.org/superset/dashboard/73/

Tue, Nov 12, 3:52 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics

Fri, Nov 8

mforns added a comment to T234484: Add data quality metric: traffic variations per country.

Hi all! One idea that maybe can reduce false positives when there are traffic peaks for any given reason.

Fri, Nov 8, 9:20 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics
mforns added a comment to T229674: Set up automatic deletion for netflow datasource in Druid.

Yes, @Nuria, the data starts 17th of August, so we can merge end of next week. Or on Monday the 18th? Better chance of having an ops person to merge the change.

Fri, Nov 8, 8:44 PM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, Nov 7

mforns added a comment to T235486: Hive data quality alarms pipeline.

It's still backfilling, will take a couple days.

Thu, Nov 7, 8:31 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns moved T235486: Hive data quality alarms pipeline from In Progress to In Code Review on the Analytics-Kanban board.
Thu, Nov 7, 7:26 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns added a comment to T235486: Hive data quality alarms pipeline.

Here's the data quality dashboard in Superset:
https://superset.wikimedia.org/superset/dashboard/73/

Thu, Nov 7, 7:26 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns updated the task description for T235486: Hive data quality alarms pipeline.
Thu, Nov 7, 7:25 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns updated the task description for T235486: Hive data quality alarms pipeline.
Thu, Nov 7, 7:25 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns updated the task description for T235486: Hive data quality alarms pipeline.
Thu, Nov 7, 7:25 PM · Patch-For-Review, Analytics, Analytics-Kanban

Tue, Nov 5

mforns added a comment to T233891: Drop Navigationtiming data entirely from mysql storage? .

EventLogging data was first enabled in Hive on 2017-11-20T19:00:00Z.
I believe that is why we have partial data on 2017, proportion seems to match.
We could theoretically drop data with timestamp >= '20171120190000'.

Tue, Nov 5, 4:19 PM · Performance-Team (Radar), Analytics-Kanban, Analytics, Analytics-EventLogging

Mon, Nov 4

mforns added a comment to T236818: Rerun sanitization before archiving eventlogging mysql data .

@elukey, LGTM!

Mon, Nov 4, 5:42 PM · Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T237047: Update data-purge for processed mediawiki_wikitext_history (6 snapshot kept, 3 would be sufficient).

@JAllemandou

Change proposal: Remove the lists from https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots and pass them as parameters.
Having this would allow us to have different jobs for different retention times.
@mforns Thoughts?

Yes, definitely :]

Mon, Nov 4, 5:18 PM · Analytics
mforns added a comment to T237072: Correct namespace zero editor counts on geoeditors_monthly table on hive and druid.

This retention period feels small!! Is the deletion scheme deleting more than expected?

Mon, Nov 4, 4:13 PM · Product-Analytics, Analytics, Patch-For-Review, Analytics-Kanban

Fri, Nov 1

mforns added a comment to T235494: Create a reports directory under analytics.wikimedia.org.

@Nuria
The reports in stat1006 are the ones created by querying MySQL.
The updated new ones are in stat1007, with all the other Hive reports, and already have systemd timers running them in puppet.
I left the old report files in stat1006 for safety. But if you think it's confusing or @CCicalese_WMF prefers to drop them, I can delete them.

Fri, Nov 1, 11:24 PM · Patch-For-Review, Analytics-Kanban, Analytics, Product-Analytics

Thu, Oct 31

mforns added a comment to T235494: Create a reports directory under analytics.wikimedia.org.

@Ottomata I think Dashiki dashboards did not like the rename:
https://pingback.wmflabs.org

Thu, Oct 31, 4:34 PM · Patch-For-Review, Analytics-Kanban, Analytics, Product-Analytics

Wed, Oct 30

mforns added a comment to T236687: Check Avro as potential better file format for wikitext-history.

I think the rlike comparison does not require the matched string to start with the pattern, unless you use ^.
The like comparison, however, does.
So, it can be that the extra counts in rlike results are records with revision texts that have something before #redirect.

Wed, Oct 30, 4:58 PM · Analytics-Kanban, Analytics

Tue, Oct 29

mforns added a comment to T231858: Archive data on eventlogging MySQL to analytics replica before decomisioning .

@Ottomata @Nuria
Another solution is archiving only the sanitized data.
I don't think having the last 90 days of data in the backup is critical, no?

Tue, Oct 29, 7:01 PM · Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T231858: Archive data on eventlogging MySQL to analytics replica before decomisioning .

In T231858#5612967, @Ottomata wrote:
Ahhh ok right. Right.

sanitize data in the log databases

This shouldn't be necessary, right? IIUC, data should be sanitized already?

I need to check with @mforns but my understanding is that the last 90d of events are not sanitized yet..

Tue, Oct 29, 3:54 PM · Analytics-Kanban, Analytics, Analytics-EventLogging

Mon, Oct 28

mforns updated the task description for T235486: Hive data quality alarms pipeline.
Mon, Oct 28, 7:10 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns updated the task description for T235486: Hive data quality alarms pipeline.
Mon, Oct 28, 7:09 PM · Patch-For-Review, Analytics, Analytics-Kanban
mforns added a comment to T220410: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list.

@mforns it looks like the subtasks we had for this have all been resolved; can this be closed?

Yes!
And thank you *a lot* @kzimmerman, @nettrom_WMF and all others for taking care of this annoying task.

Mon, Oct 28, 7:00 PM · Product-Analytics, Analytics

Thu, Oct 24

mforns added a comment to T226861: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech.

The white-list which we use as single source of truth as of what EL data gets sanitized and kept indefinitely is:
https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml
There you can maybe find your schemas by name?

Thu, Oct 24, 9:11 PM · Product-Analytics (Kanban), Community-Tech, Analytics
mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

BTW @CCicalese_WMF
I also added the MediaWiki version 1.32 to the PHP drilldown Dashboard, which was missing:
https://meta.wikimedia.org/w/index.php?title=Config%3ADashiki%3APingback&type=revision&diff=19488635&oldid=17828094

Thu, Oct 24, 2:03 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

All data from MySQL:log.MediaWikiPingback_* is now present in Hive:event_sanitized.mediawikipingback.
Also, all pingback reportupdater-queries have been completely migrated to Hive and tested.

Thu, Oct 24, 1:58 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns moved T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop from In Progress to Done on the Analytics-Kanban board.
Thu, Oct 24, 1:49 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns committed rARPQacae319244cd: Fix bug in pingback/php_drilldown (authored by mforns).
Fix bug in pingback/php_drilldown
Thu, Oct 24, 7:51 AM

Wed, Oct 23

mforns committed rARPQ074c87e72b14: Correct and optimize pingback reports (authored by mforns).
Correct and optimize pingback reports
Wed, Oct 23, 4:20 PM

Oct 17 2019

mforns added a comment to T235718: Puppetize reportupdater to run wmcs reports.

@srishakatux This is the puppet patch.
When it gets merged by one of our ops people, RU will start to execute your job periodically.
You'll find the resulting report in stat1007.eqiad.wmnet:/srv/reportupdater/output/metrics/wmcs.
Remember the first data-point corresponds to 2019-10-01, so it will be calculated on 2019-11-02 (1 day lag).
Cheers!

Oct 17 2019, 5:37 AM · Analytics, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Oct 16 2019

mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

Sorry @srishakatux I didn't notice in the code review that the report name in the config does not match the file name of the script.
They should match for RU to work. Could you please fix that in a new patch?
Thanks a lot.

Oct 16 2019, 8:45 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
mforns moved T235268: HivePartition (refinery::Hive.py) does not allow partition values to have dots (.) from Ready to Deploy to Done on the Analytics-Kanban board.
Oct 16 2019, 8:05 PM · Analytics-Kanban, Analytics
mforns moved T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics from In Progress to Done on the Analytics-Kanban board.
Oct 16 2019, 8:05 PM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a comment to T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.

The docs are here: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Entropy_alarms
Moving task to Done.

Oct 16 2019, 8:04 PM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

Cool!
I will then copy the data over to Hive and rerun the queries to correct inconsistencies in the reports and dashboards.
I will also create a task regarding the other thoughts we discussed.
Thanks a lot for the inputs!

Oct 16 2019, 6:53 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

Totally understand your point.

Oct 16 2019, 6:28 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

If you'd be fine with the solution in my prior comment, I could apply it to the mysql database for the 2017 data, then copy the reports over to stat1007 and complete them querying Hive.
This way you'd have complete data since 2017-04 and we'd solve the query performance caveat. Plus it would be easier for us Analytics to migrate it that way, rather than importing mysql data to Hive.
What do you think?

Oct 16 2019, 1:20 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging

Oct 15 2019

mforns moved T234588: Add data quality metric: distribution of eventlogging user agents from In Progress to Done on the Analytics-Kanban board.
Oct 15 2019, 7:32 PM · Analytics-Kanban, Analytics
mforns added a comment to T234588: Add data quality metric: distribution of eventlogging user agents.

Ok, makes sense. I guess, this scenario where we want to calculate secondary data quality metrics that help troubleshoot, or we want to first plot some new data quality metrics before we decide whether to alarm on them or not, is going to happen again. So, the data quality pipeline should probably have an option to activate/deactivate alarming on given metrics. Will take this into account when implementing that.
In the meantime, as the calculation of the useragent entropy metrics is finished, I will move this task to done.

Oct 15 2019, 7:31 PM · Analytics-Kanban, Analytics
mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

The trends over time are one of the important aspects of the pingback data. It would be good to import the missing data from 2017-04 to 2017-11 into hive so we have coverage from the beginning of the period in which we began to collect that data.

I understand.

Oct 15 2019, 7:25 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T234588: Add data quality metric: distribution of eventlogging user agents.

@Nuria Should I remove the rest of the useragent metrics from the data quality Oozie pipeline then?

Oct 15 2019, 3:35 PM · Analytics-Kanban, Analytics
mforns moved T234870: Help panel: delete sanitized data from before Oct 1 from Next Up to Done on the Analytics-Kanban board.
Oct 15 2019, 3:34 PM · Analytics-Kanban, Analytics, Product-Analytics, Growth-Team (Current Sprint)
mforns claimed T234870: Help panel: delete sanitized data from before Oct 1.
Oct 15 2019, 3:34 PM · Analytics-Kanban, Analytics, Product-Analytics, Growth-Team (Current Sprint)
mforns added a comment to T234870: Help panel: delete sanitized data from before Oct 1.

I have deleted all data directories and Hive partitions for event_sanitized.helppanel up to Oct 1st 2019 (not included).
I checked that the table looks good, but please ping us if you find any inconsistency.
The deleted data will stay in Hadoop's trash folder for a couple weeks, in case you want to recover something, then will be automatically deleted.
Cheers!

Oct 15 2019, 3:32 PM · Analytics-Kanban, Analytics, Product-Analytics, Growth-Team (Current Sprint)
mforns moved T235486: Hive data quality alarms pipeline from Next Up to In Progress on the Analytics-Kanban board.
Oct 15 2019, 10:30 AM · Patch-For-Review, Analytics, Analytics-Kanban
mforns added a comment to T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.

I still need to post the results of the proof of concept to Wikitech.

Oct 15 2019, 10:30 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a comment to T234588: Add data quality metric: distribution of eventlogging user agents.

What's missing in this task is to decide which of the 4 (one or more) metrics considered for the proof of concept of entropy-based quality metrics are we going to chose to be productionized as a data quality alarm.

Oct 15 2019, 10:29 AM · Analytics-Kanban, Analytics
mforns moved T234588: Add data quality metric: distribution of eventlogging user agents from Next Up to In Progress on the Analytics-Kanban board.
Oct 15 2019, 10:27 AM · Analytics-Kanban, Analytics
mforns removed a subtask for T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics: T235483: Add data quality metric: distribution of page view article titles.
Oct 15 2019, 10:26 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a subtask for T198986: Data Quality Alarms : T235483: Add data quality metric: distribution of page view article titles.
Oct 15 2019, 10:25 AM · Analytics-Kanban, Analytics
mforns edited parent tasks for T235483: Add data quality metric: distribution of page view article titles, added: T198986: Data Quality Alarms ; removed: T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.
Oct 15 2019, 10:25 AM · Analytics
mforns removed a subtask for T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics: T234588: Add data quality metric: distribution of eventlogging user agents.
Oct 15 2019, 10:25 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a subtask for T198986: Data Quality Alarms : T234588: Add data quality metric: distribution of eventlogging user agents.
Oct 15 2019, 10:25 AM · Analytics-Kanban, Analytics
mforns edited parent tasks for T234588: Add data quality metric: distribution of eventlogging user agents, added: T198986: Data Quality Alarms ; removed: T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.
Oct 15 2019, 10:25 AM · Analytics-Kanban, Analytics
mforns removed a subtask for T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics: T234484: Add data quality metric: traffic variations per country.
Oct 15 2019, 10:25 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a subtask for T198986: Data Quality Alarms : T234484: Add data quality metric: traffic variations per country.
Oct 15 2019, 10:25 AM · Analytics-Kanban, Analytics
mforns edited parent tasks for T234484: Add data quality metric: traffic variations per country, added: T198986: Data Quality Alarms ; removed: T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.
Oct 15 2019, 10:25 AM · Patch-For-Review, Research, Analytics-Kanban, Analytics
mforns removed a subtask for T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics: T227357: Alarming scripts for entropy alarms. Anomaly detection and reporting..
Oct 15 2019, 10:24 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns removed a parent task for T227357: Alarming scripts for entropy alarms. Anomaly detection and reporting.: T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.
Oct 15 2019, 10:24 AM · Analytics-Kanban, Analytics
mforns added a subtask for T198986: Data Quality Alarms : T235486: Hive data quality alarms pipeline.
Oct 15 2019, 10:23 AM · Analytics-Kanban, Analytics
mforns added a parent task for T235486: Hive data quality alarms pipeline: T198986: Data Quality Alarms .
Oct 15 2019, 10:23 AM · Patch-For-Review, Analytics, Analytics-Kanban
mforns merged T227357: Alarming scripts for entropy alarms. Anomaly detection and reporting. into T235486: Hive data quality alarms pipeline.
Oct 15 2019, 10:22 AM · Patch-For-Review, Analytics, Analytics-Kanban
mforns merged task T227357: Alarming scripts for entropy alarms. Anomaly detection and reporting. into T235486: Hive data quality alarms pipeline.
Oct 15 2019, 10:22 AM · Analytics-Kanban, Analytics
mforns renamed T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics from Coarse alarm on data quality for refined data based on entrophy calculations to Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.
Oct 15 2019, 10:19 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns created T235486: Hive data quality alarms pipeline.
Oct 15 2019, 10:16 AM · Patch-For-Review, Analytics, Analytics-Kanban
mforns added a subtask for T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics: T235483: Add data quality metric: distribution of page view article titles.
Oct 15 2019, 10:05 AM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a parent task for T235483: Add data quality metric: distribution of page view article titles: T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.
Oct 15 2019, 10:05 AM · Analytics
mforns renamed T234484: Add data quality metric: traffic variations per country from Add anomaly detection alarm to detect traffic variations on countries overall to Add data quality metric: traffic variations per country.
Oct 15 2019, 10:04 AM · Patch-For-Review, Research, Analytics-Kanban, Analytics
mforns created T235483: Add data quality metric: distribution of page view article titles.
Oct 15 2019, 10:03 AM · Analytics
mforns renamed T234588: Add data quality metric: distribution of eventlogging user agents from alarming based on anomaly detection: add alarm on distribution of user agents and/or pageview titles to Add data quality metric: distribution of eventlogging user agents.
Oct 15 2019, 10:00 AM · Analytics-Kanban, Analytics
mforns moved T198986: Data Quality Alarms from Next Up to Parent Tasks on the Analytics-Kanban board.
Oct 15 2019, 9:57 AM · Analytics-Kanban, Analytics
mforns added a project to T198986: Data Quality Alarms : Analytics-Kanban.
Oct 15 2019, 9:57 AM · Analytics-Kanban, Analytics

Oct 11 2019

mforns added a comment to T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

@CCicalese_WMF Hi :]

Oct 11 2019, 7:13 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns created T235283: Add partition pruning for wmf.browser_general and interlanguage.
Oct 11 2019, 4:28 PM · Analytics-Kanban, Analytics
mforns added a comment to T235278: browser dashboards not updated since 09/29.

Note that the pingback reports are still bad after the migration to Hive.
But not because of scheduling reasons, rather the queries are still not doing what the MySQL ones did before.
Working on that right now.

Oct 11 2019, 3:48 PM · Analytics
mforns added a comment to T235278: browser dashboards not updated since 09/29.

This is normal behavior I think.
In reportupdater weekly reports span weeks starting on sunday 00:00:00 and ending following saturday 23:59:59.
Also, the data-point receives the label of the start date of the week. Meaning 2019-09-29 is the week that starts at that date and ends at 2019-10-05T23:59:59.
The subsequent week (2019-10-06T00:00:00 -> 2019-10-12T23:59:59) has not been calculated yet, because it is still not passed completely.
Also, usually reports have a couple hours or one day lag, to give time for the collected data to reach the database.
It is likely that the reports update early on Monday 2019-10-14.

Oct 11 2019, 3:39 PM · Analytics
mforns created T235269: MediaWiki history dumps have some events in 2025.
Oct 11 2019, 2:31 PM · Chinese-Sites, Analytics-Kanban, Analytics
mforns moved T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop from Ready to Deploy to In Progress on the Analytics-Kanban board.
Oct 11 2019, 2:22 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns moved T235268: HivePartition (refinery::Hive.py) does not allow partition values to have dots (.) from Next Up to In Code Review on the Analytics-Kanban board.
Oct 11 2019, 2:21 PM · Analytics-Kanban, Analytics
mforns created T235268: HivePartition (refinery::Hive.py) does not allow partition values to have dots (.).
Oct 11 2019, 2:19 PM · Analytics-Kanban, Analytics

Oct 10 2019

mforns moved T208612: Release edit data lake data as a public json dump /mysql dump, other? from Ready to Deploy to Done on the Analytics-Kanban board.
Oct 10 2019, 6:23 PM · Patch-For-Review, Analytics-Kanban, Research-Backlog, Analytics
mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

In T232671#5548529, @srishakatux wrote:
@JAllemandou Some update and questions on the reportupdater:

  • The docs says: “The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future...”. Do I need to follow this?

I don't know if the limitation is still valid or not - I have looked at some example queries (see repo in my next line) and the first column is always date ... @mforns can you tell us more?

Yes, this limitation is still there. The first column of the results should be the date in YYYY-MM-DD format.

Oct 10 2019, 2:47 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Oct 9 2019

mforns created T235112: dumps.wikimedia.org/other/mediawiki_history is missing some files.
Oct 9 2019, 4:37 PM · Analytics-Kanban, Analytics

Oct 8 2019

mforns moved T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics from Paused to In Progress on the Analytics-Kanban board.
Oct 8 2019, 3:24 PM · Patch-For-Review, Analytics-Kanban, Analytics
mforns moved T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Oct 8 2019, 3:23 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging

Oct 7 2019

mforns moved T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop from In Progress to In Code Review on the Analytics-Kanban board.
Oct 7 2019, 5:50 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns committed rARPQ2d20fdd98562: Migrate reports from MySQL EventLogging to Hive (authored by mforns).
Migrate reports from MySQL EventLogging to Hive
Oct 7 2019, 3:51 PM

Oct 3 2019

mforns updated subscribers of T223414: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop.

Hi @CCicalese_WMF and @santhosh

Oct 3 2019, 8:04 PM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-EventLogging
mforns added a comment to T234484: Add data quality metric: traffic variations per country.

I think the query that defines this quality metric should just forward the absolute value of pageviews per country.
I think normal+stddev is not needed there, because the anomaly detection algorithm should take care of that, no?

Oct 3 2019, 5:36 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics

Sep 27 2019

mforns moved T131280: Make aggregate data on editors per country per wiki publicly available from Parent Tasks to In Progress on the Analytics-Kanban board.
Sep 27 2019, 2:29 PM · Product-Analytics, Analytics-Kanban
mforns moved T131280: Make aggregate data on editors per country per wiki publicly available from In Progress to Parent Tasks on the Analytics-Kanban board.
Sep 27 2019, 2:29 PM · Product-Analytics, Analytics-Kanban

Sep 24 2019

mforns added a comment to T229682: Add more dimensions to netflow's druid ingestion specs.

Great!

Sep 24 2019, 3:16 PM · Analytics-Kanban, Analytics
mforns added a comment to T208612: Release edit data lake data as a public json dump /mysql dump, other?.

Cool!
Thanks @Bstorm

Sep 24 2019, 2:16 PM · Patch-For-Review, Analytics-Kanban, Research-Backlog, Analytics

Sep 23 2019

mforns added a comment to T208612: Release edit data lake data as a public json dump /mysql dump, other?.

@ArielGlenn thanks for chiming in!

Sep 23 2019, 6:23 PM · Patch-For-Review, Analytics-Kanban, Research-Backlog, Analytics
mforns added a comment to T226698: Allow all Analytics tools to work with Kerberos auth.

@elukey

Eventually the data will be accessible via datasets.w.o or dumps.w.o right?

Yes, from dumps.w.o

Sep 23 2019, 2:56 PM · Patch-For-Review, Analytics-Kanban, User-Elukey, Analytics

Sep 20 2019

mforns added a comment to T214093: Modern Event Platform: Schema Guidelines and Conventions.

@Ottomata Hm indeed...
Specifying sanitization for the whole map can be dangerous, because a non-sensitive map field marked as 'keep' could in the future be added a sensitive field, and it would be 'kept' by default.
How will it be with geocoded_data and user_agent fields? Are they structs or maps in MEP? Some EL schemas do specify to keep parts of them.
We could maybe do:

geocoded_data:
  type: object
  additionalProperties:
    type: string
  annotations:
    sanitize:
        country: keep
        state: hash
        blah:
            foo: keep
            bar: hash
    olap_properties: [country:dimension,state:dimension]
Sep 20 2019, 7:40 PM · Analytics-Kanban, CPT Initiatives (Modern Event Platform (TEC2)), Analytics, Better Use Of Data, Patch-For-Review, Product-Analytics, Goal, Services (watching), Analytics-EventLogging, Event-Platform
mforns added a comment to T204735: Move the Analytics Refinery to Python 3.

@elukey
Reviewed https://gerrit.wikimedia.org/r/538235 and the fix makes sense to me!
There is a call to os.system returning a value, but I checked the python docs and it seems to be both python2 and python3 return the exact same for os.system.

Sep 20 2019, 5:44 PM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a comment to T226698: Allow all Analytics tools to work with Kerberos auth.

@elukey
I'm not sure about the underlying magic, but what I've seen in other similar cases of data sets that need rsync and publication in analytics.wikimedia.org, is here:
https://github.com/wikimedia/puppet/blob/production/modules/dumps/manifests/web/fetches/stats.pp
I have added one analog code block to that file pointing towards MediaWiki history dumps,.
I believe the source directory is mounted in stat1007, see: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/dumps/distribution/datasets/fetcher.pp#L18
I know miscdumpsdir is /srv/dumps/xmldatadumps/public/other but I'm not sure of which host. Shouldn't it be Thorium, from where analytics.wikimedia.org is served?

Sep 20 2019, 2:18 PM · Patch-For-Review, Analytics-Kanban, User-Elukey, Analytics
mforns added a comment to T226698: Allow all Analytics tools to work with Kerberos auth.

@elukey
Yes! MediaWiki history dumps need to be rsync'd from HDFS mount.

Sep 20 2019, 2:01 PM · Patch-For-Review, Analytics-Kanban, User-Elukey, Analytics