Page MenuHomePhabricator

mforns (Marcel Ruiz Forns)
Software Engineer @ Analytics

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Nov 7 2014, 8:52 PM (272 w, 3 d)
Availability
Available
IRC Nick
mforns
LDAP User
Mforns
MediaWiki User
Unknown

Recent Activity

Thu, Jan 23

mforns updated the task description for T235486: Hive data quality alarms pipeline.
Thu, Jan 23, 7:46 PM · Analytics, Analytics-Kanban
mforns moved T235486: Hive data quality alarms pipeline from In Code Review to Done on the Analytics-Kanban board.
Thu, Jan 23, 7:45 PM · Analytics, Analytics-Kanban
mforns moved T241375: The guava error still persists in data quality bundles from Next Up to In Code Review on the Analytics-Kanban board.
Thu, Jan 23, 7:45 PM · Patch-For-Review, Analytics-Kanban, Analytics

Wed, Jan 22

mforns added a comment to T241375: The guava error still persists in data quality bundles.

Thanks for the explanation Joseph.
Working on this right now.

Wed, Jan 22, 7:14 PM · Patch-For-Review, Analytics-Kanban, Analytics

Wed, Jan 15

mforns added a comment to T242870: Upgrade to Superset 0.35.2.
  • I see a new feature in 0.35.2: filter labels in the chart views of a dashboard. They tell you which params you can alter and when you click on them, they point you to the corresponding control. And they added some coloring. Seems cool!
  • They have changed the top menu order and the icon of the "Manage" option in the menu has disappeared.
  • Apart from this, couldn't find anything that is broken or different.
Wed, Jan 15, 4:48 PM · User-Elukey, Better Use Of Data, Analytics-Kanban, Product-Analytics
mforns added a comment to T242870: Upgrade to Superset 0.35.2.

Will do!

Wed, Jan 15, 4:12 PM · User-Elukey, Better Use Of Data, Analytics-Kanban, Product-Analytics

Tue, Jan 14

mforns added a comment to T241375: The guava error still persists in data quality bundles.

Maven tree shows 4 different versions of guava:

  • 11.0.2 Used by json-schema-core.jackson-coreutils and CDH5.hadoop-common
  • 12.0 Used by reflection
  • 16.0.1 Used by hadoop-common.hadoop-auth.apache-curator and uri-template
  • 18.0 Specified in refinery-core's pom.xml

From all those versions, the only one that does not have the method com.google.common.base.Stopwatch.<init>() implemented is 18.0.
If I understand it correctly, the version compiled and included in the jar is 18.0 ([INFO] +- com.google.guava:guava:jar:18.0:compile).

Tue, Jan 14, 10:39 PM · Patch-For-Review, Analytics-Kanban, Analytics
mforns added a project to T241375: The guava error still persists in data quality bundles: Analytics-Kanban.
Tue, Jan 14, 10:10 PM · Patch-For-Review, Analytics-Kanban, Analytics

Mon, Jan 13

mforns closed T242451: Wikistats Bug - Can't find stats for number of "Very Active" editors as Invalid.

The metric is there, but maybe not directly visible:
You have to select the "editors" metric, and then enable the split by "activity level".
https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/editors/normal|line|2-year|activity_level~100..-edits|monthly

Mon, Jan 13, 3:50 PM · Analytics, Analytics-Wikistats
mforns created T242621: [Wikistats2] Normalize pageviews per country by population.
Mon, Jan 13, 3:25 PM · Analytics

Fri, Jan 10

mforns moved T235486: Hive data quality alarms pipeline from In Progress to In Code Review on the Analytics-Kanban board.
Fri, Jan 10, 4:07 PM · Analytics, Analytics-Kanban

Thu, Jan 2

mforns added a comment to T237124: Growth: implement wider data purge window.

@nettrom_WMF
We enabled the deletion of the data for the 3 specified schemas: HelpPanel, HomepageVisit, HomepageModule.
No data has been deleted yet because all events are still less than 270 days old.
So, provided you have everything you want to keep in the sanitization white-list, I guess this task can be marked as done!
Cheers

Thu, Jan 2, 3:19 PM · Growth-Team, Patch-For-Review, Analytics, Product-Analytics
mforns created T241734: Pages with + character do not have pageviews for May 2019.
Thu, Jan 2, 2:56 PM · Analytics
mforns added a comment to T241375: The guava error still persists in data quality bundles.

@elukey There was no task, because this was treated as part of the initial task to develop the data quality metrics.
The fix that we did was bump up the oozie_spark_lib property to spark-2.4.4. And it seemed to reduce the frequency of this problem!
But it turns out it's still there.

Thu, Jan 2, 9:36 AM · Patch-For-Review, Analytics-Kanban, Analytics

Dec 23 2019

mforns created T241375: The guava error still persists in data quality bundles.
Dec 23 2019, 5:53 PM · Patch-For-Review, Analytics-Kanban, Analytics

Dec 19 2019

mforns moved T219446: Terminate Wikimetrics from In Progress to Done on the Analytics-Kanban board.
Dec 19 2019, 5:10 PM · Analytics-Kanban, Operations, Analytics

Dec 18 2019

mforns moved T235486: Hive data quality alarms pipeline from In Code Review to In Progress on the Analytics-Kanban board.
Dec 18 2019, 9:34 PM · Analytics, Analytics-Kanban
mforns moved T234484: Add data quality metric: traffic variations per country from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Dec 18 2019, 9:34 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics
mforns moved T240815: Webrequest text fails to refine regularly from Next Up to Paused on the Analytics-Kanban board.
Dec 18 2019, 7:19 PM · Analytics-Kanban, Analytics
mforns claimed T240815: Webrequest text fails to refine regularly.
Dec 18 2019, 7:18 PM · Analytics-Kanban, Analytics
mforns added a comment to T240815: Webrequest text fails to refine regularly.

Looked whether the entropy UDAF indicates any unexpected changes in the webrequest fields that are used in webrequest.load.

=== USER AGENT ===
hour	entropy
11	10.54753394241449
12	10.56885036456681
13	10.494309570202567
14	10.475783511181957
15	10.41000334885259
Dec 18 2019, 5:21 PM · Analytics-Kanban, Analytics

Dec 9 2019

mforns triaged T194058: Sesssion reconstruction - evaluate privacy threat as Medium priority.
Dec 9 2019, 5:50 PM · Analytics
mforns raised the priority of T194058: Sesssion reconstruction - evaluate privacy threat from Medium to Needs Triage.
Dec 9 2019, 5:50 PM · Analytics
mforns added a comment to T194058: Sesssion reconstruction - evaluate privacy threat.

@Nuria is this something we want to tackle next year?

Dec 9 2019, 5:49 PM · Analytics
mforns lowered the priority of T193650: Reindex mediawiki_history_reduced with lookups from Medium to Lowest.
Dec 9 2019, 5:47 PM · Analytics, Analytics-Wikistats
mforns lowered the priority of T193174: [reportupdater] consider not requiring date as a first colum of query/script results from Medium to Low.
Dec 9 2019, 5:47 PM · Analytics
mforns lowered the priority of T193171: [reportupdater] Allow defaults for all config parameters from Medium to Low.
Dec 9 2019, 5:46 PM · good first task, Analytics
mforns lowered the priority of T193170: [reportupdater] eliminate the funnel parameter from Medium to Lowest.
Dec 9 2019, 5:46 PM · Analytics
mforns triaged T193169: [reportupdater] Add a configurable hive client as Medium priority.
Dec 9 2019, 5:46 PM · Analytics
mforns raised the priority of T193169: [reportupdater] Add a configurable hive client from Medium to Needs Triage.
Dec 9 2019, 5:46 PM · Analytics
mforns triaged T193167: reportupdater TLC as Medium priority.
Dec 9 2019, 5:46 PM · Analytics
mforns raised the priority of T193167: reportupdater TLC from Medium to Needs Triage.
Dec 9 2019, 5:46 PM · Analytics
mforns raised the priority of T190700: Automate creation of sqoop list of wikis to import data for from sitematrix from Medium to High.
Dec 9 2019, 5:44 PM · Analytics, Analytics-Wikistats
mforns triaged T189623: AQS edits API should not allow queries without time bounds as Medium priority.
Dec 9 2019, 5:31 PM · Analytics
mforns raised the priority of T189623: AQS edits API should not allow queries without time bounds from Medium to Needs Triage.
Dec 9 2019, 5:31 PM · Analytics
mforns triaged T189044: Mediawiki History: moves counted twice in Revision as Medium priority.
Dec 9 2019, 5:30 PM · Analytics
mforns raised the priority of T189044: Mediawiki History: moves counted twice in Revision from Medium to Needs Triage.
Dec 9 2019, 5:30 PM · Analytics
mforns placed T188041: Generate pagecounts-ez data back to 2008 up for grabs.
Dec 9 2019, 5:28 PM · Analytics
mforns moved T188041: Generate pagecounts-ez data back to 2008 from Smart Tools for Better Data to Mentoring on the Analytics board.
Dec 9 2019, 5:28 PM · Analytics
mforns raised the priority of T178832: Investigate AQS cassandra schema hash warninga from Medium to Needs Triage.
Dec 9 2019, 5:27 PM · Analytics
mforns triaged T178832: Investigate AQS cassandra schema hash warninga as Medium priority.
Dec 9 2019, 5:27 PM · Analytics
mforns moved T178832: Investigate AQS cassandra schema hash warninga from Operational Excellence to Ops Week on the Analytics board.
Dec 9 2019, 5:26 PM · Analytics
mforns lowered the priority of T212928: [Spike] Spark job for digests-only mediawiki-history-reduced from High to Medium.
Dec 9 2019, 5:23 PM · Analytics
mforns added a comment to T239903: Kerberize Superset to allow Presto queries.

Oh! Cool.
Thanks for looking into this, @elukey!

Dec 9 2019, 3:58 PM · User-Elukey, Analytics-Kanban, Analytics
mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

The most unambiguous data-point labels we could use would be intervals, like: "2019-10-01T00:00:00 - 2019-10-31T23:59:59".
But that would be too long for charts and usually inconvenient, so we chose to use the start of the interval as the label.
So, 2019-10-01 means the data of that data-point belongs to October 2019 (because data is monthly), but not that it was calculated at that date, that would be impossible!
A data-point labeled 2019-10-01 is calculated by reportupdater (in monthly granularity setup) once the corresponding month is over (plus any offset specified in the config), in this case (2019-11-01 + OFFSET).
I know it's not elegant, but reportupdater needs the full YYYY-MM-DD date to recognize present/missing time ranges and to work properly.
We could change that! But for now it's needed.
As for file names, I'm OK with the ones you prefer. You're the owner of those reports! :]

Dec 9 2019, 3:57 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services

Dec 5 2019

mforns reassigned T239685: Analytics: Some pages/page requests are not reflected in statistics from Nuria to Milimetric.
Dec 5 2019, 6:28 PM · Wikimedia Design Style Guide, Analytics
mforns moved T239685: Analytics: Some pages/page requests are not reflected in statistics from Incoming to Ops Week on the Analytics board.
Dec 5 2019, 6:28 PM · Wikimedia Design Style Guide, Analytics
mforns assigned T239685: Analytics: Some pages/page requests are not reflected in statistics to Nuria.
Dec 5 2019, 6:27 PM · Wikimedia Design Style Guide, Analytics
mforns lowered the priority of T233073: Test if Hue can run with Python3 from High to Medium.
Dec 5 2019, 6:20 PM · User-Elukey, Analytics
mforns raised the priority of T233073: Test if Hue can run with Python3 from Medium to High.
Dec 5 2019, 6:20 PM · User-Elukey, Analytics
mforns triaged T233073: Test if Hue can run with Python3 as Medium priority.
Dec 5 2019, 6:19 PM · User-Elukey, Analytics
mforns moved T233073: Test if Hue can run with Python3 from Incoming to Operational Excellence on the Analytics board.
Dec 5 2019, 6:19 PM · User-Elukey, Analytics
mforns added a comment to T233073: Test if Hue can run with Python3.

Grosking: We can try to see if it works with CDH5,
or we could deprecate Hue, and use sth else?

Dec 5 2019, 6:19 PM · User-Elukey, Analytics
mforns closed T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop , a subtask of T238878: Data about how many file pages on Commons contain at least one structured data element , as Resolved.
Dec 5 2019, 6:17 PM · Product-Analytics, SDC General, Analytics, Wikidata
mforns closed T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop as Resolved.
Dec 5 2019, 6:17 PM · Analytics-Kanban, Analytics
mforns moved T239127: Import slots/slots_roles and wikibase.wbc_entity_usage through scoop from Incoming to Smart Tools for Better Data on the Analytics board.
Dec 5 2019, 6:17 PM · Analytics-Kanban, Analytics
mforns moved T239130: Superset getting slower as usage increases from Operational Excellence to Radar on the Analytics board.
Dec 5 2019, 6:16 PM · Analytics
mforns moved T239130: Superset getting slower as usage increases from Incoming to Operational Excellence on the Analytics board.
Dec 5 2019, 6:15 PM · Analytics
mforns assigned T239130: Superset getting slower as usage increases to Nuria.
Dec 5 2019, 6:12 PM · Analytics
mforns added a comment to T239130: Superset getting slower as usage increases.

It could maybe be Druid as well.
Let's troubleshoot and determine the cause.

Dec 5 2019, 6:12 PM · Analytics
mforns triaged T239136: Revise wiki scoop list from labs once a quarter as Low priority.
Dec 5 2019, 6:10 PM · Analytics
mforns moved T239136: Revise wiki scoop list from labs once a quarter from Incoming to Ops Week on the Analytics board.
Dec 5 2019, 6:10 PM · Analytics
mforns triaged T239365: Degraded RAID on an-worker1089 as High priority.
Dec 5 2019, 6:10 PM · Analytics, ops-eqiad, Operations
mforns moved T239365: Degraded RAID on an-worker1089 from Incoming to Operational Excellence on the Analytics board.
Dec 5 2019, 6:10 PM · Analytics, ops-eqiad, Operations
mforns moved T239393: Public data set review for T237728 from Incoming to Radar on the Analytics board.
Dec 5 2019, 6:09 PM · Privacy, Analytics, WMDE-Analytics-Engineering, User-GoranSMilovanovic
mforns updated subscribers of T239393: Public data set review for T237728.

I think @JFishback_WMF can help you with this task.

Dec 5 2019, 6:09 PM · Privacy, Analytics, WMDE-Analytics-Engineering, User-GoranSMilovanovic
mforns triaged T239565: Create reportupdater reports that execute SDC requests as High priority.
Dec 5 2019, 6:06 PM · Analytics-Kanban, Product-Analytics, SDC General, Wikidata, Analytics
mforns moved T239565: Create reportupdater reports that execute SDC requests from Incoming to Smart Tools for Better Data on the Analytics board.
Dec 5 2019, 6:06 PM · Analytics-Kanban, Product-Analytics, SDC General, Wikidata, Analytics
mforns assigned T239571: Check home leftovers of dfoy to Milimetric.
Dec 5 2019, 6:04 PM · Product-Analytics, Analytics
mforns triaged T239571: Check home leftovers of dfoy as Medium priority.
Dec 5 2019, 6:03 PM · Product-Analytics, Analytics
mforns moved T239571: Check home leftovers of dfoy from Incoming to Ops Week on the Analytics board.
Dec 5 2019, 6:03 PM · Product-Analytics, Analytics
mforns triaged T239589: Change sqoop project list config so that content sqoop doesn't fail as High priority.
Dec 5 2019, 6:02 PM · Analytics
mforns moved T239589: Change sqoop project list config so that content sqoop doesn't fail from Incoming to Smart Tools for Better Data on the Analytics board.
Dec 5 2019, 6:02 PM · Analytics
mforns triaged T239591: Update mediawiki-history to use new Multi-Content-Revision tables as High priority.
Dec 5 2019, 6:01 PM · Core Platform Team, Analytics
mforns moved T239591: Update mediawiki-history to use new Multi-Content-Revision tables from Incoming to Smart Tools for Better Data on the Analytics board.
Dec 5 2019, 6:01 PM · Core Platform Team, Analytics
mforns moved T239655: Mediaviewer preloads should be marked as such via x-analytics tag from Incoming to Radar on the Analytics board.
Dec 5 2019, 6:01 PM · Multimedia, Analytics
mforns closed T239848: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article as Resolved.
Dec 5 2019, 6:00 PM · Analytics-Kanban, Analytics
mforns triaged T239848: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article as High priority.
Dec 5 2019, 6:00 PM · Analytics-Kanban, Analytics
mforns moved T239848: Delay cassandra mediarequest-per-file daily job one hour so that it doesn't colide with pageview-per-article from Incoming to Ops Week on the Analytics board.
Dec 5 2019, 6:00 PM · Analytics-Kanban, Analytics
mforns triaged T239852: Add pertinent wdqs_external_sparql_query metrics and wdqs_internal_sparql_query to a superset dashboard as High priority.
Dec 5 2019, 5:59 PM · Analytics
mforns moved T239852: Add pertinent wdqs_external_sparql_query metrics and wdqs_internal_sparql_query to a superset dashboard from Incoming to Smart Tools for Better Data on the Analytics board.
Dec 5 2019, 5:59 PM · Analytics
mforns added a comment to T239852: Add pertinent wdqs_external_sparql_query metrics and wdqs_internal_sparql_query to a superset dashboard .

We should ingest that data into druid, so that is queryable from Superset.

Dec 5 2019, 5:59 PM · Analytics
mforns moved T239885: Creating a wikipedia CDN caching trace from Incoming to Radar on the Analytics board.
Dec 5 2019, 5:58 PM · Analytics, MediaWiki-Cache, Research
mforns added a comment to T239885: Creating a wikipedia CDN caching trace .

Does this data set answer your needs? Not sure if you're asking for that.
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Caching

Dec 5 2019, 5:57 PM · Analytics, MediaWiki-Cache, Research
mforns triaged T239903: Kerberize Superset to allow Presto queries as High priority.
Dec 5 2019, 5:56 PM · User-Elukey, Analytics-Kanban, Analytics
mforns added a comment to T239903: Kerberize Superset to allow Presto queries.

We could ping dropbox, to see if they want to upgrade pypi?
We could also ping superset if they want to change lib? <-- Maybe better bet.

Dec 5 2019, 5:55 PM · User-Elukey, Analytics-Kanban, Analytics
mforns moved T239903: Kerberize Superset to allow Presto queries from Incoming to Operational Excellence on the Analytics board.
Dec 5 2019, 5:53 PM · User-Elukey, Analytics-Kanban, Analytics

Dec 4 2019

mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

@srishakatux I can see November data in the dashboard. It must be a caching issue, and should be over soon.

Dec 4 2019, 9:52 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
mforns added a comment to T237124: Growth: implement wider data purge window.

I will implement a deletion timer specific to those 3 schemas,
that will delete all their data from the event_sanitized database after 270 days of collection.

Dec 4 2019, 8:34 PM · Growth-Team, Patch-For-Review, Analytics, Product-Analytics
mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

@srishakatux Finally I think the jobs run and their results are expected!
Please check the results!
They are not yet in https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/wmcs/ but will be soon, I hope.

Dec 4 2019, 7:25 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
mforns committed rARPQa0357ef562f0: Add funnel parameter to wmcs queries that return multiple rows (authored by mforns).
Add funnel parameter to wmcs queries that return multiple rows
Dec 4 2019, 6:25 PM
mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

@srishakatux
One of the queries failed again!
I had tested it before from the hive command line and it worked fine!
But as reportupdater executes it as a script, the ${wikis} hive var was being interpreted and replaced as a bash parameter, thus failing in hive.
I escaped the $ sign in the query and this will hopefully fix the problem, see gerrit changes.

Dec 4 2019, 4:48 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
mforns committed rARPQc19cf1b2cb4d: Escape dollar sign in hive script for wmcs (authored by mforns).
Escape dollar sign in hive script for wmcs
Dec 4 2019, 4:47 PM
mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

@srishakatux
There were a couple minor bugs in 2 of the queries.
That's why the reports weren't there.
Sorry for not having catched those in the code review.
I created another patch that fixes the problems and also removes some unnecessary code.
Merged the patch to unbreak production but left some comments there in case you want to look!
Hopefully, in a couple hours you should see the reports updated in https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/wmcs/.
Cheers!

Dec 4 2019, 3:20 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
mforns committed rARPQ53c9e262c43b: Correct minor details in wmcs queries (authored by mforns).
Correct minor details in wmcs queries
Dec 4 2019, 3:17 PM
mforns added a comment to T237124: Growth: implement wider data purge window.

@Nuria, I believe, for now, that would be OK for them.
@nettrom_WMF explained that they are aiming to make short term analyses of 270 days,
and that they have no interest so far to keep a fully-sanitized version of the data for longer.

Dec 4 2019, 12:59 PM · Growth-Team, Patch-For-Review, Analytics, Product-Analytics

Dec 3 2019

mforns added a comment to T232671: Use Reportupdater for WMCS edits queries.

@srishakatux I checked reportupdater logs and it seems the queries have failed, they seem to return no results for some reason.
I will troubleshoot this tomorrow and let you know!

Dec 3 2019, 11:21 PM · Patch-For-Review, Developer-Advocacy (Oct-Dec 2019), Cloud-Services
mforns added a comment to T237124: Growth: implement wider data purge window.

After discussing with @nettrom_WMF we concluded that the solution described above is not a fit.
The raw (unsanitized) data does not have the approval from legal to be kept for 270 days.
For the data to be kept for 270 days, some fields need to be sanitized.
It wouldn't be a "full" sanitization, meaning some fields would still be privacy-sensitive to a certain level,
but the "half-sanitization" would be enough to keep the data for 270 days.

Dec 3 2019, 10:38 PM · Growth-Team, Patch-For-Review, Analytics, Product-Analytics

Dec 2 2019

mforns updated subscribers of T239591: Update mediawiki-history to use new Multi-Content-Revision tables.

@WDoranWMF Hi!
We are trying to prioritize this task,
do you know when the changes to the revision table (move fields to content table through slots) are going to take place?
Thanks!

Dec 2 2019, 5:09 PM · Core Platform Team, Analytics
mforns triaged T239625: Improve quality of external referer data as High priority.
Dec 2 2019, 4:58 PM · Product-Analytics, Analytics-Kanban, Research, Analytics