Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (298 w, 2 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Yesterday

JAllemandou added a comment to T207171: Have a way to show the most popular pages per country.

My Takeaways:

  • If we want useful data by country by language with day granularity, then the pageview threshold should be on the order of 100(s), which seems relatively low in terms of privacy

I agree that threshold of 100s pageviews seems small for privacy.

Fri, Oct 30, 2:05 PM · Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics

Tue, Oct 27

JAllemandou added a comment to T266578: Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org.

I did a quick check for month 2020-09:

spark.sql("""
SELECT
  (caused_by_user_id IS NULL) as by_anon,
  page_namespace_is_content, -- current value of the page_namespace for the page
  page_namespace_is_content_historical, -- page_namespace at the time of page creation
  COUNT(1)
FROM wmf.mediawiki_page_history
WHERE snapshot = '2020-09'
  AND caused_by_event_type = 'create
  AND start_timestamp >= '2020-09-01'
  AND wiki_db = 'enwiki'
  AND not page_is_deleted
GROUP BY
  (caused_by_user_id IS NULL),
  page_namespace_is_content,
  page_namespace_is_content_historical
ORDER BY
  by_anon,
  page_namespace_is_content,
  page_namespace_is_content_historical
""").show(100, false)
Tue, Oct 27, 8:07 PM · Analytics-Data-Quality, Analytics-Dashiki, Analytics, Analytics-Wikistats, Product-Analytics
JAllemandou moved T266322: Possible issue between Maxmind and Hive 2.x libs in Refinery source from In Progress to Ready to Deploy on the Analytics-Kanban board.
Tue, Oct 27, 5:37 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T262261: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month .

Final note.
I have found 2 problems with mediawiki-events (over simplewiki only, using mediawiki-history as a baseline):

  • Some events are lost (seen in revision-create, page-create, page-move) - documented in T215001.
  • Some events are duplicated (seen in revision-create only) - documented in T262203.

I have found new features to be added to events to allow to recreate mediawiki-history using events:

  • Add user-create and user-rename events - documented in T262205
  • Add logging events (corresponding to an addition to the logging table) and a reference to those events in the main related events. For instance every page-create, page-move, user-create etc events would point to their corresponding logging-event - Documented in T263055
Tue, Oct 27, 2:25 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T266322: Possible issue between Maxmind and Hive 2.x libs in Refinery source .

Following the stackoverflow link pasted in the task I have:

Tue, Oct 27, 2:08 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T262261: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month from In Progress to Done on the Analytics-Kanban board.
Tue, Oct 27, 10:03 AM · Analytics-Kanban, Analytics
JAllemandou added a comment to T256050: Add dimensions to editors_daily dataset.

No problem @cchen - It's a shame if the patch stays stale while needed :)

Tue, Oct 27, 9:01 AM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics
JAllemandou moved T263736: Improve mediawiki-wikitext spark job repartitioning from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Tue, Oct 27, 8:50 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T263736: Improve mediawiki-wikitext spark job repartitioning from In Progress to In Code Review on the Analytics-Kanban board.
Tue, Oct 27, 8:45 AM · Patch-For-Review, Analytics-Kanban, Analytics

Mon, Oct 26

JAllemandou moved T263529: Prevent dumps-dependent jobs to wait indefinitely from Ready to Deploy to Done on the Analytics-Kanban board.
Mon, Oct 26, 5:00 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T256050: Add dimensions to editors_daily dataset.

Ping @cchen on the code review - The CR moves slowly, is that ok for you @cchen or should we try to find a better coordination?

Mon, Oct 26, 4:59 PM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics
JAllemandou moved T238400: Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc. from Paused to In Progress on the Analytics-Kanban board.
Mon, Oct 26, 4:40 PM · Analytics-Kanban, Event-Platform, Analytics
JAllemandou added a comment to T262256: Test hudi and Iceberg as an incremental update system using 2 mediawiki-history snapshots.

Next up: test Apache Iceberg when data mutation feature (Copy on Write) is released.

Mon, Oct 26, 4:39 PM · Analytics-Kanban, Analytics
JAllemandou renamed T262256: Test hudi and Iceberg as an incremental update system using 2 mediawiki-history snapshots from Test hudi as an incremental update system using 2 mediawiki-history snapshots to Test hudi and Iceberg as an incremental update system using 2 mediawiki-history snapshots.
Mon, Oct 26, 4:39 PM · Analytics-Kanban, Analytics
JAllemandou moved T262256: Test hudi and Iceberg as an incremental update system using 2 mediawiki-history snapshots from In Progress to Paused on the Analytics-Kanban board.
Mon, Oct 26, 4:38 PM · Analytics-Kanban, Analytics
JAllemandou merged task T264660: Wikistats - Add avk.wikipedia.or to scoop list into T258033: Stats for newer projects not available.
Mon, Oct 26, 4:32 PM · Analytics-Kanban, Analytics, Analytics-Wikistats
JAllemandou merged T264660: Wikistats - Add avk.wikipedia.or to scoop list into T258033: Stats for newer projects not available.
Mon, Oct 26, 4:32 PM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T260409: Establish what data must be backed up before the HDFS upgrade from Next Up to Ready to Deploy on the Analytics-Kanban board.
Mon, Oct 26, 4:30 PM · Analytics-Kanban, Analytics
JAllemandou moved T266322: Possible issue between Maxmind and Hive 2.x libs in Refinery source from Next Up to In Progress on the Analytics-Kanban board.
Mon, Oct 26, 4:30 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T263529: Prevent dumps-dependent jobs to wait indefinitely from In Progress to Ready to Deploy on the Analytics-Kanban board.
Mon, Oct 26, 4:29 PM · Analytics-Kanban, Analytics
JAllemandou moved T261283: Review current usage of HDFS and establish what/if data can be dropped periodically from In Progress to Paused on the Analytics-Kanban board.
Mon, Oct 26, 4:29 PM · Analytics-Kanban, Patch-For-Review, Analytics
JAllemandou claimed T266322: Possible issue between Maxmind and Hive 2.x libs in Refinery source .
Mon, Oct 26, 3:48 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T264896: Fix the remaining bugs open on for Hue next.

I have experienced problems with jobs pagination as well:

Mon, Oct 26, 9:10 AM · Analytics
JAllemandou awarded T265487: Review recurrent Hadoop worker disk saturation events a Hungry Hippo token.
Mon, Oct 26, 8:45 AM · Analytics-Clusters
JAllemandou awarded T266374: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history a Grey Medal token.
Mon, Oct 26, 8:29 AM · Product-Analytics, Analytics
JAllemandou added a comment to T266374: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history.

The expected difference is to have more (possibly many more) tag-based reverts than checksum-based reverts.

Mon, Oct 26, 8:28 AM · Product-Analytics, Analytics
JAllemandou awarded T257412: Review an-coord1001's usage and failover plans a Mountain of Wealth token.
Mon, Oct 26, 8:23 AM · Patch-For-Review, Analytics-Clusters

Tue, Oct 20

JAllemandou added a comment to T265851: Sqoop problem on stat1004.

@GoranSMilovanovic the needed docs are updated (no sqoop page per say, but related sqoop usages in other pages).
Feel free to close the task. Thanks!

Tue, Oct 20, 12:23 PM · User-GoranSMilovanovic, WMDE-Analytics-Engineering, Analytics

Thu, Oct 15

JAllemandou added a comment to T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data.

I quickly reviewed refinery and refine-to-druid jobs and found that none uses either postqal-code not lat/long. I think we're safe to remove them :)

Thu, Oct 15, 6:59 AM · Analytics-Kanban, Product-Analytics, Analytics

Tue, Oct 13

JAllemandou added a comment to T256050: Add dimensions to editors_daily dataset.

Hi @cchen - I have commented on your CR last week and wanted to be sure you noticed :)

Tue, Oct 13, 12:37 PM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics

Fri, Oct 9

JAllemandou added a comment to T261841: Tag WDQS query log with the source of the query (UI vs direct access).

Some more info on this aspect: I have done a quick analysis over September queries today and found that my assumption that long queries were made by users from UI is wrong.

Fri, Oct 9, 3:39 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou added a comment to T256050: Add dimensions to editors_daily dataset.

Thanks @cchen - Let's make the CR move for data to appear from next month :)

Fri, Oct 9, 6:51 AM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics

Thu, Oct 8

JAllemandou moved T263529: Prevent dumps-dependent jobs to wait indefinitely from Done to In Progress on the Analytics-Kanban board.
Thu, Oct 8, 5:14 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T207171: Have a way to show the most popular pages per country.

speaks of a bug

I disagree :)
The raw data is available for per-language project broadly on the API or on dumps. Therefore the top endpoint is a just computation facilitated over data available.
For country data it is different as it is a dimension we currently provide for pageviews at project level only, and with bucketing.

Thu, Oct 8, 3:50 PM · Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics
JAllemandou added a comment to T256050: Add dimensions to editors_daily dataset.

Starting to work on this.
I have exchanged with @Milimetric about the platform field making the number of editors non-additive.

Thu, Oct 8, 1:24 PM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics
JAllemandou added a comment to T264945: Update Wikidata usage metric.

@Nuria: I disagree - The reason for which we can't have historical data for this metric is because wmf_raw.mediawiki_wbc_entity_usage is not historified. We could always use the last available dump of wmf.mediawiki_page_history and be able to get results very close to perfect based on historical data being present and mostly correct in that dataset. The work to be done to be able to reproduce historical results in this specific case (as in quite some others where the tables are not historified) is to enforce some way of historification, whether through content-parsing (there must be a way to rebuild wikidata-item usage in a page when parsing the revision content), or logging table.

Thu, Oct 8, 8:04 AM · Analytics-Kanban, Analytics

Wed, Oct 7

JAllemandou moved T262826: Purge raw webrequest_stats and webrequest_stats_hourly from Next Up to In Code Review on the Analytics-Kanban board.
Wed, Oct 7, 6:48 PM · Analytics-Kanban, Analytics
JAllemandou claimed T262826: Purge raw webrequest_stats and webrequest_stats_hourly.
Wed, Oct 7, 6:48 PM · Analytics-Kanban, Analytics
JAllemandou added a project to T262826: Purge raw webrequest_stats and webrequest_stats_hourly: Analytics-Kanban.
Wed, Oct 7, 6:48 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T260409: Establish what data must be backed up before the HDFS upgrade.

Ok after a quick chat with Dan it appears my way of presenting might have been confusing. Here are some clarification (hopefully?)
We wish to backup everything except

  • raw data, meaning all of /wmf/data/raw except one folder (webrequest-sequence-stats). Data not backed up includes raw data from camus (webrequest, events, netflow), dumps copied from labstore, sqooped data.
  • 2 month of refined webrequest (oldest) as in /wmf/data/wmf/webrequest/*/year=2020/month=[78]
  • processed wikitext, as it is huge and can be regenerated (even if long)
Wed, Oct 7, 6:14 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T260409: Establish what data must be backed up before the HDFS upgrade.

You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions

"unprocessed" as in raw-from-camus (the /wmf/data/raw doesn't get backed-up). /wmf/data/event and /wmf/data/event_sanitized are both planed to be backed-up.

Wed, Oct 7, 5:10 PM · Analytics-Kanban, Analytics

Tue, Oct 6

JAllemandou added a comment to T264791: Rework how mediawiki-history differentiates fake page-create from real ones.

couldn't we just join based on the page_first_revision timestamp?

We'd need to differentiate which timestamp is used for the join from event-type: user first-revision-timestamp for create events, otherwise user event-timestamp. We would also need to triple check that pages have a single and unique create event. This solution is effective and easy to implement, I like it :)

Tue, Oct 6, 8:20 PM · Analytics
JAllemandou updated the task description for T264791: Rework how mediawiki-history differentiates fake page-create from real ones.
Tue, Oct 6, 8:17 PM · Analytics
JAllemandou updated the task description for T264791: Rework how mediawiki-history differentiates fake page-create from real ones.
Tue, Oct 6, 7:48 PM · Analytics
JAllemandou updated the task description for T264791: Rework how mediawiki-history differentiates fake page-create from real ones.
Tue, Oct 6, 7:36 PM · Analytics
JAllemandou created T264791: Rework how mediawiki-history differentiates fake page-create from real ones.
Tue, Oct 6, 7:36 PM · Analytics
JAllemandou added a comment to T262920: Indexing errors / malformed logs for aqs on cassandra timeout.

Thanks @colewhite for the explanation.

Tue, Oct 6, 4:22 PM · Analytics, observability
JAllemandou added a comment to T261841: Tag WDQS query log with the source of the query (UI vs direct access).

I continued my analysis today looking at top-100 parsed user-agents from both queries-with-referer subset, and queries-without-referer subset, over the month of September.
See https://phabricator.wikimedia.org/P12933

  • The queries-with-referer have a defined user-agent. meaning that the user-agent-parser we use to extract structured information from the user-agent line provides values for a lot of its fields. By looking at the top-100 user-agents we actually cover more than 90% of requests made with referer
  • The queries-without-referer have either an undefined or Spider user-agent, meaning that the user-agent line is either not parseable or is parsed as a bot. I inspected manually the user-agent lines and confirm that most of the user-agent lines looks like bots (particularly the ones making most requests). By looking at the top 100 user-agents we also cover more than 90% of requests made without referer.
Tue, Oct 6, 1:45 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou created P12933 WDQS Top-100 user-agent analysis 2020-09.
Tue, Oct 6, 1:44 PM
JAllemandou added a comment to T260409: Establish what data must be backed up before the HDFS upgrade.

After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).
Here is the sizing I come up with (using useful size, not replicated one):

hdfs dfs -du -s -h /670Tb
hdfs dfs -du -s -h /var/log34Tb
hdfs dfs -du -s -h /wmf/data/raw140Tb
hdfs dfs -du -s -h /wmf/data/wmf/webrequest/*/year=2020/month=[78]83Tb
hdfs dfs -du -s -h /wmf/data/wmf/mediawiki/wikitext/history53Tb
Total ( 670 - (34 + 140 + 83 + 53)360Tb
Tue, Oct 6, 12:40 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T264660: Wikistats - Add avk.wikipedia.or to scoop list .

avkwiki has been added in https://gerrit.wikimedia.org/r/c/analytics/refinery/+/628917
This patch however got deployed after this month sqoop, so data will be available from next month only.
Sorry for the delay.

Tue, Oct 6, 10:10 AM · Analytics-Kanban, Analytics, Analytics-Wikistats

Mon, Oct 5

JAllemandou moved T256050: Add dimensions to editors_daily dataset from Next Up to In Progress on the Analytics-Kanban board.
Mon, Oct 5, 7:56 PM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics
JAllemandou moved T263529: Prevent dumps-dependent jobs to wait indefinitely from In Code Review to Done on the Analytics-Kanban board.
Mon, Oct 5, 6:32 PM · Analytics-Kanban, Analytics

Fri, Oct 2

JAllemandou added a comment to T261841: Tag WDQS query log with the source of the query (UI vs direct access).

Heya - I'm sorry I completely missed the ping :S
Quick analysis:

spark.sql("SELECT (http.request_headers['referer'] IS NOT NULL) as defined_referer, count(1) as c from event.wdqs_external_sparql_query where year = 2020 and month = 9 group by (http.request_headers['referer'] IS NOT NULL) limit 100").show(100, false)
+---------------+---------+                                                     
|defined_referer|c        |
+---------------+---------+
|false          |165201676|
|true           |5613278  |
+---------------+---------+

--> 3.3% of requests have referer defined for September

Fri, Oct 2, 6:29 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JAllemandou moved T258047: Import page_props table to Hive from Ready to Deploy to Done on the Analytics-Kanban board.
Fri, Oct 2, 4:31 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T258047: Import page_props table to Hive.

@MMiller_WMF we missed this month deploy of this change, will it be oK to wait for the run of November 1st or you needed it sooner?

Fri, Oct 2, 8:53 AM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, Oct 1

JAllemandou created T264358: Investigate oozie banner monthly job timeouts.
Thu, Oct 1, 7:55 PM · Analytics

Sep 29 2020

JAllemandou added a comment to T264081: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st.

My 2 cents on that one: Oozie has a setting about how long it keeps historical information for workflows/coords/bundles. I imagine we can manually tweak it to drop recent info, but that would mean loosing possibly interesting data from other jobs.
Another approach is to manually query Mysql oozie table for workflows and drop the finished one from backfilling.

Sep 29 2020, 11:48 AM · Analytics

Sep 28 2020

JAllemandou added a comment to T264021: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface.

Idea: Could missing-revisions (T215001) be related to this?

Sep 28 2020, 5:31 PM · Analytics-Kanban, Traffic, Analytics, Operations
JKatzWMF awarded T258047: Import page_props table to Hive a Yellow Medal token.
Sep 28 2020, 4:29 PM · Patch-For-Review, Analytics-Kanban, Analytics

Sep 24 2020

JAllemandou added a comment to T263496: Augment NEL reports with GeoIP country code and network AS number.

Currently these reports are going to Logstash; I don't think there's any refinement possible there?

Not the refinement we do usually on the cluster indeed.

We also value near-realtime for this stream, which I'm not sure but I think complicates any Hive/refinement discussion?

Correct - We use hourly batch in spark, but also wait for late data. near real-time implies stream-processing :)

Sep 24 2020, 3:04 PM · Patch-For-Review, Analytics, Operations
JAllemandou added a comment to T263496: Augment NEL reports with GeoIP country code and network AS number.

Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster be sufficient?

Sep 24 2020, 2:03 PM · Patch-For-Review, Analytics, Operations
JAllemandou moved T263736: Improve mediawiki-wikitext spark job repartitioning from Next Up to In Code Review on the Analytics-Kanban board.
Sep 24 2020, 11:38 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T263736: Improve mediawiki-wikitext spark job repartitioning.
Sep 24 2020, 11:37 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou created T263736: Improve mediawiki-wikitext spark job repartitioning.
Sep 24 2020, 11:37 AM · Patch-For-Review, Analytics-Kanban, Analytics

Sep 22 2020

JAllemandou renamed T263529: Prevent dumps-dependent jobs to wait indefinitely from Prevent wikidata-entity jobs to wait indefinitely to Prevent dumps-dependent jobs to wait indefinitely.
Sep 22 2020, 9:39 AM · Analytics-Kanban, Analytics
JAllemandou moved T263529: Prevent dumps-dependent jobs to wait indefinitely from Next Up to In Code Review on the Analytics-Kanban board.
Sep 22 2020, 9:35 AM · Analytics-Kanban, Analytics
JAllemandou claimed T263529: Prevent dumps-dependent jobs to wait indefinitely.
Sep 22 2020, 9:35 AM · Analytics-Kanban, Analytics
JAllemandou created T263529: Prevent dumps-dependent jobs to wait indefinitely.
Sep 22 2020, 9:35 AM · Analytics-Kanban, Analytics

Sep 21 2020

JAllemandou moved T258047: Import page_props table to Hive from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Sep 21 2020, 7:28 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T258033: Stats for newer projects not available.

@The_Discoverer - Done :)

Sep 21 2020, 7:21 PM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T258033: Stats for newer projects not available from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Sep 21 2020, 12:56 PM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T258033: Stats for newer projects not available from In Progress to In Code Review on the Analytics-Kanban board.
Sep 21 2020, 12:49 PM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou added a comment to T258033: Stats for newer projects not available.

Just checked: those projects are now available on labsdb as well as the analytics replica.
Adding them to the sqoop-list.

Sep 21 2020, 12:44 PM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou moved T258033: Stats for newer projects not available from Next Up to In Progress on the Analytics-Kanban board.
Sep 21 2020, 12:40 PM · Analytics-Kanban, Analytics-Wikistats, Analytics
JAllemandou added a project to T258047: Import page_props table to Hive: Analytics-Kanban.
Sep 21 2020, 12:39 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T262184: Sort editors-by-country by descending editor-ceil value in cassandra from Ready to Deploy to Done on the Analytics-Kanban board.
Sep 21 2020, 12:39 PM · Analytics-Kanban, Analytics
JAllemandou moved T258047: Import page_props table to Hive from Next Up to In Code Review on the Analytics-Kanban board.
Sep 21 2020, 12:39 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou claimed T258047: Import page_props table to Hive.
Sep 21 2020, 12:39 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou updated subscribers of T262920: Indexing errors / malformed logs for aqs on cassandra timeout.

After talking with @elukey we're not sure if there is anything we want to do here.
The error happened when Luca roll-restarted the AQS hosts.
@Pchelolo : Do you have any idea of upgrading hyperswitch would change anything here?

Sep 21 2020, 10:53 AM · Analytics, observability
JAllemandou added a comment to T262261: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month .

Last news on revision_create for simplewiki 2020-07:

  • All kafka-events match mediawiki-history except the ones with deleted-parts (mostly because performer_id is hidden, leading to a mismatch in row_hash_key) - Should be fixable with revision-visibility-change events.
  • Some events of mediawiki-history are missing kafka-events - see T215001
  • Some kafka-events are duplicated - see T262203
Sep 21 2020, 9:44 AM · Analytics-Kanban, Analytics
JAllemandou updated subscribers of T263157: Process to check approximate correctness of analytics pipeline outputs.

I view 2 aspects this task covers;

  • correctness of ETL algorithms (unit-testing using real data) - deequ fits under this category. AFAIK Analytics has not yet worked with any library in that space.
  • anomaly detection over data - the work done by @mforns and the research team on the data-quality pipeline and its RSVD algo for anomaly detection

@EBernhardson : How would you like us to help?

Sep 21 2020, 9:40 AM · Discovery-Search

Sep 17 2020

JAllemandou added a comment to T263055: Add log entry details to page and user events in EventBus.

While it is possible to join streams with Flink, we'll start by using log-events in hadoop in batch mode. I agree it makes more sense to add a log_id in events if there is a log stream :) That represent some more work (adding again a new event-type), but I like it :)

Sep 17 2020, 3:32 PM · Platform Engineering, Analytics-Kanban, Analytics
JAllemandou added a comment to T215001: Revisions missing from mediawiki_revision_create.

@Jalemayehu I assume the last table is the missing revisions. Which wiki are these from?

I assume that was for me :) Indeed the table are the first 100 missing revisions ordered by rev_id for simplewiki in July 2020.

Sep 17 2020, 3:28 PM · Analytics-Kanban, Growth-Team, Product-Analytics, Analytics
JAllemandou added a comment to T215001: Revisions missing from mediawiki_revision_create.

Ping @Milimetric , @Ottomata and @Pchelolo on that one please :)

Sep 17 2020, 8:02 AM · Analytics-Kanban, Growth-Team, Product-Analytics, Analytics
JAllemandou added a comment to T215001: Revisions missing from mediawiki_revision_create.

Things I have checks:

  • Similar ratio of revision made by anonymous users vs registered users between missing revisions and not missing ones. Doens't seem anonymous/registered related
  • Similar ratio of namespaces - doesn't seem namespace related
  • Missing revisions are usually happening by small bulks in both revision_id/time (100 rows, can show all if needed):
+---------------------+----------------+                                        
|hudi_event_timestamp |hudi_revision_id|
+---------------------+----------------+
|2020-07-01 03:47:00.0|7013511         |
|2020-07-01 03:47:00.0|7013512         |
|2020-07-01 14:43:02.0|7025295         |
|2020-07-01 17:24:12.0|7023563         |
|2020-07-01 18:04:52.0|7014298         |
|2020-07-01 18:04:52.0|7014299         |
|2020-07-01 18:04:52.0|7014300         |
|2020-07-01 18:04:52.0|7014301         |
|2020-07-01 18:16:19.0|7014313         |
|2020-07-01 18:16:20.0|7014314         |
|2020-07-01 18:16:59.0|7014316         |
|2020-07-01 18:16:59.0|7014317         |
|2020-07-01 18:17:00.0|7014318         |
|2020-07-01 18:17:00.0|7014319         |
|2020-07-01 18:17:20.0|7014322         |
|2020-07-01 18:17:20.0|7014323         |
|2020-07-01 18:17:21.0|7014324         |
|2020-07-01 18:17:21.0|7014325         |
|2020-07-01 18:17:41.0|7014326         |
|2020-07-01 18:17:41.0|7014327         |
|2020-07-01 18:18:20.0|7014328         |
|2020-07-01 18:18:21.0|7014329         |
|2020-07-01 18:21:05.0|7014340         |
|2020-07-01 18:21:05.0|7014341         |
|2020-07-01 20:22:37.0|7014496         |
|2020-07-01 20:22:38.0|7014497         |
|2020-07-01 21:07:12.0|7014578         |
|2020-07-01 21:07:13.0|7014579         |
|2020-07-01 21:07:13.0|7014580         |
|2020-07-01 21:07:13.0|7014581         |
|2020-07-01 22:00:41.0|7014667         |
|2020-07-02 12:10:11.0|7015244         |
|2020-07-02 12:10:11.0|7015245         |
|2020-07-02 13:20:18.0|7015316         |
|2020-07-02 15:33:01.0|7015455         |
|2020-07-02 15:33:02.0|7015457         |
|2020-07-02 15:33:32.0|7015459         |
|2020-07-02 17:59:26.0|7015621         |
|2020-07-02 23:18:03.0|7021670         |
|2020-07-03 00:25:20.0|7016025         |
|2020-07-03 01:22:05.0|7021668         |
|2020-07-03 04:49:09.0|7016147         |
|2020-07-03 04:49:09.0|7016148         |
|2020-07-03 13:45:46.0|7016538         |
|2020-07-03 13:48:02.0|7016554         |
|2020-07-03 13:48:02.0|7016555         |
|2020-07-03 17:00:10.0|7016867         |
|2020-07-03 17:00:15.0|7016869         |
|2020-07-03 17:00:15.0|7016871         |
|2020-07-03 17:00:20.0|7016874         |
|2020-07-03 17:00:21.0|7016876         |
|2020-07-03 17:00:53.0|7016878         |
|2020-07-03 17:00:54.0|7016880         |
|2020-07-03 17:09:54.0|7016901         |
|2020-07-03 17:10:07.0|7016904         |
|2020-07-03 17:10:16.0|7016906         |
|2020-07-03 17:11:49.0|7016910         |
|2020-07-03 17:25:45.0|7016931         |
|2020-07-03 17:28:34.0|7016940         |
|2020-07-03 17:29:10.0|7016941         |
|2020-07-03 17:29:32.0|7016942         |
|2020-07-03 17:29:32.0|7016943         |
|2020-07-03 17:33:34.0|7016955         |
|2020-07-03 18:00:36.0|7021672         |
|2020-07-03 20:31:46.0|7032969         |
|2020-07-03 20:49:38.0|7033012         |
|2020-07-03 21:11:12.0|7017212         |
|2020-07-03 23:00:55.0|7017325         |
|2020-07-04 01:53:41.0|7017466         |
|2020-07-04 01:53:41.0|7017467         |
|2020-07-04 13:06:25.0|7018133         |
|2020-07-04 13:06:25.0|7018134         |
|2020-07-04 17:00:27.0|7018257         |
|2020-07-04 17:00:28.0|7018258         |
|2020-07-04 18:44:52.0|7018334         |
|2020-07-04 18:44:52.0|7018335         |
|2020-07-04 19:07:34.0|7018359         |
|2020-07-04 19:07:34.0|7018360         |
|2020-07-04 22:23:46.0|7018853         |
|2020-07-04 22:23:46.0|7018854         |
|2020-07-04 22:52:17.0|7018953         |
|2020-07-04 22:52:17.0|7018954         |
|2020-07-04 23:09:07.0|7019128         |
|2020-07-04 23:09:07.0|7019129         |
|2020-07-05 02:45:22.0|7019780         |
|2020-07-05 02:45:22.0|7019781         |
|2020-07-05 02:45:34.0|7019782         |
|2020-07-05 02:45:34.0|7019783         |
|2020-07-05 03:42:44.0|7019821         |
|2020-07-05 03:42:44.0|7019822         |
|2020-07-05 03:58:08.0|7019830         |
|2020-07-05 03:58:08.0|7019831         |
|2020-07-05 06:03:42.0|7021674         |
|2020-07-05 07:03:56.0|7019956         |
|2020-07-05 07:23:00.0|7019972         |
|2020-07-05 16:43:50.0|7020410         |
|2020-07-05 16:44:08.0|7020411         |
|2020-07-05 18:44:35.0|7020539         |
|2020-07-05 18:44:35.0|7020540         |
|2020-07-05 18:45:36.0|7020542         |
+---------------------+----------------+
Sep 17 2020, 8:01 AM · Analytics-Kanban, Growth-Team, Product-Analytics, Analytics
JAllemandou added a comment to T215001: Revisions missing from mediawiki_revision_create.

Now is my time for this. Here is some data for simplewiki only in presto for July and August 2020:

  • July
select count(distinct rev_id) from event.mediawiki_revision_create where year = 2020 and ((month = 6 and day = 30) OR (month = 7) OR (month = 8 and day = 1)) and rev_timestamp like '2020-07%' and database = 'simplewiki';
 _col0 
-------
 38771 
(1 row)
Sep 17 2020, 7:43 AM · Analytics-Kanban, Growth-Team, Product-Analytics, Analytics
JAllemandou added a comment to T263055: Add log entry details to page and user events in EventBus.

My 2 cents on this.
For automated usage of data, we want to be able to join representations of the same actions in a (hopefully) non-fuzzy way. For revisions, the database + revision_id should be unique, good. For other events such as page-create, page-rename etc, the action representation in mediawiki database is a row in the logging table. While I agree that in an event-centered world, possibly we'd prefer put the event-id in the logging-table, pushing toward events as the source of truth, both implementation-side work for our purpose and I don't really mind one of the other.

Sep 17 2020, 6:32 AM · Platform Engineering, Analytics-Kanban, Analytics

Sep 15 2020

JAllemandou awarded T262942: PoC on anomaly detection with Flink a Love token.
Sep 15 2020, 6:58 PM · Discovery-Search (Current work), Analytics-Radar, Wikidata, Wikidata-Query-Service

Sep 14 2020

JAllemandou added a comment to T262203: Duplicated revision_create events.

Ping @Ottomata and @Pchelolo - I found something even more problematic: the duplication can happen with a change of performer!

presto> SELECT
     ->   database,
     ->   rev_id,
     ->   array_distinct(array_agg(performer.user_text)) as users,
     ->   COUNT(1) AS rev_id_count_gt_1
     -> FROM event.mediawiki_revision_create
     -> WHERE year = 2020 and month = 9
     -> GROUP BY database, rev_id
     -> HAVING COUNT(1) > 1 AND cardinality(array_distinct(array_agg(performer.user_text))) > 1
     -> LIMIT 10;
 database |  rev_id   |             users             | rev_id_count_gt_1 
----------+-----------+-------------------------------+-------------------
 enwiki   | 977034865 | [104.235.36.59, 96.33.68.122] |                 2 
(1 row)
Sep 14 2020, 6:02 PM · Platform Team Workboards (Clinic Duty Team), Analytics-Radar, Event-Platform
JAllemandou created T262826: Purge raw webrequest_stats and webrequest_stats_hourly.
Sep 14 2020, 3:13 PM · Analytics-Kanban, Analytics
JAllemandou added a comment to T262203: Duplicated revision_create events.

There are events with different timetamps. In July 2020 (when the number of dups peaked) I found a revision with up to 16 events, and different timestamp (1h diff between every).
Recent data:

presto> SELECT
     ->   database,
     ->   rev_id,
     ->   array_distinct(array_agg(meta.dt)) as timestamps,
     ->   COUNT(1) AS rev_id_count_gt_1
     -> FROM event.mediawiki_revision_create
     -> WHERE year = 2020 and month = 9
     -> GROUP BY database, rev_id
     -> HAVING COUNT(1) > 1 AND cardinality(array_distinct(array_agg(meta.dt))) > 1
     -> LIMIT 10;
  database   |  rev_id   |                  timestamps                  | rev_id_count_gt_1 
-------------+-----------+----------------------------------------------+-------------------
 commonswiki | 451735365 | [2020-09-08T06:25:50Z, 2020-09-08T06:25:51Z] |                 2 
 enwiki      | 977973886 | [2020-09-12T03:08:38Z, 2020-09-12T03:08:39Z] |                 2 
 arwiki      |  50310422 | [2020-09-09T22:56:19Z, 2020-09-09T22:56:20Z] |                 2 
 azwiki      |   5441991 | [2020-09-11T06:45:47Z, 2020-09-11T06:45:41Z] |                 2 
 dewiki      | 203613472 | [2020-09-12T18:41:03Z, 2020-09-12T18:41:04Z] |                 2 
 enwiki      | 977958479 | [2020-09-12T01:02:55Z, 2020-09-12T01:02:53Z] |                 2 
 enwiki      | 977080082 | [2020-09-06T20:17:26Z, 2020-09-06T20:17:59Z] |                 2 
 commonswiki | 454236590 | [2020-09-10T16:45:15Z, 2020-09-10T16:45:16Z] |                 2 
 enwiki      | 978150065 | [2020-09-13T05:54:09Z, 2020-09-13T05:54:08Z] |                 2 
 enwiki      | 977839636 | [2020-09-11T08:32:07Z, 2020-09-11T08:32:06Z] |                 2 
(10 rows)
Sep 14 2020, 2:00 PM · Platform Team Workboards (Clinic Duty Team), Analytics-Radar, Event-Platform
JAllemandou added a comment to T262141: pagecounts-ez of month 2020-08 is incomplete.
NOTE: Talking about pagecounts-ez folder below, not other pageview/pagecount folders.
Sep 14 2020, 8:07 AM · Analytics-Kanban, Analytics
JAllemandou updated the task description for T260409: Establish what data must be backed up before the HDFS upgrade.
Sep 14 2020, 7:23 AM · Analytics-Kanban, Analytics

Sep 13 2020

JAllemandou added a comment to T262742: REST API pageviews won't fetch / incorrectly fetching using URL.

Hi @Onedaytheywokemeup , thanks for reporting :)
I can't reproduce TypeError: Failed to fetch using the API doc.
I However confirm the returned result is not the expected one for the Dubstep page, due to the missing capital D in Dubstep in the query:

Sep 13 2020, 7:35 AM · Analytics, Pageviews-API

Sep 10 2020

JAllemandou added a comment to T262203: Duplicated revision_create events.

Wow that looks like you found it @Pchelolo! I wish I cou;d find bugs as fast as you do :)

Sep 10 2020, 4:24 PM · Platform Team Workboards (Clinic Duty Team), Analytics-Radar, Event-Platform
JAllemandou added a comment to T262203: Duplicated revision_create events.

duplicates should be tolerated.

Sep 10 2020, 4:16 PM · Platform Team Workboards (Clinic Duty Team), Analytics-Radar, Event-Platform
JAllemandou added a comment to T262203: Duplicated revision_create events.

so 0.006% of revision create events * might be* unnecessary purged

Sep 10 2020, 4:11 PM · Platform Team Workboards (Clinic Duty Team), Analytics-Radar, Event-Platform
JAllemandou added a comment to T262203: Duplicated revision_create events.

More info on how frequently it happened:

Sep 10 2020, 4:07 PM · Platform Team Workboards (Clinic Duty Team), Analytics-Radar, Event-Platform
JAllemandou moved T262261: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month from Next Up to In Progress on the Analytics-Kanban board.
Sep 10 2020, 11:57 AM · Analytics-Kanban, Analytics