JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (199 w, 5 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Yesterday

JAllemandou added a comment to T211627: Mediawiki history has no data on IP blocks.

Hi @nettrom_WMF ,
Indeed the mediawiki_history table doesn't contain historical blocks (or actually group, but it's not relevant).
The approach taken when rebuilding user-history through logging table was to concentrate on registered-users, as IPs are by nature multi-user, changing etc.
The use-case you describe here makes a lot of sense, and while we have curretnly too many things on the plate, I'll talk with the team to prioritize adding blocks for IPs.

Mon, Dec 10, 8:46 PM · Anti-Harassment, Product-Analytics, Analytics
JAllemandou added a comment to T202490: Automate XML-to-parquet transformation for XML dumps (oozie job).

OMG ! Done ... Sorry for having skipped that one :S

Mon, Dec 10, 7:23 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou reassigned T209031: Not able to scoop comment table in labs for mediawiki reconstruction process from JAllemandou to Milimetric.
Mon, Dec 10, 10:17 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou moved T211000: Failure while refining webrequest upload 2018-12-01-14 from In Progress to In Code Review on the Analytics-Kanban board.
Mon, Dec 10, 10:16 AM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T211535: Comments missing in mediawiki_history table.

Thanks for pointing this out @Tgr. This is a known issue due indeed to the comment-storage change.
Comments are unavialable for snapshots 2018-10 and 2018-11 and we are currently devising a solution so that 2018-12 should have them.

Mon, Dec 10, 8:32 AM · Analytics

Mon, Dec 3

JAllemandou added a comment to T211000: Failure while refining webrequest upload 2018-12-01-14.

Something to notice is that the problem with the rows listed above occurs all along the hour but only shows-up at the end-border of the calendar-hour since this is the time we finalize the batch. The following queries count how many rows have
big discrepancies in sequence-numbers (next sequence being more than 100 bigger instead of 1) when grouped per minute batches (instead of hour-batches), for the same hour as above:

spark.sql("""
SELECT COUNT(1) FROM (
SELECT
  sequence,
  LAG(sequence) OVER minute_hostname_window AS prev_seq
FROM wmf_raw.webrequest
WHERE webrequest_source = 'upload'
  AND year = 2018
  AND month = 12
  AND day = 1
  AND hour = 14
  AND dt != '-'
-- Catch only requests having a big diff in seq
HAVING abs(sequence - prev_seq) > 100
WINDOW minute_hostname_window AS (PARTITION BY substr(dt, 15, 2), hostname ORDER by hour, sequence)
) t
""").show(1000)
Mon, Dec 3, 7:57 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T211000: Failure while refining webrequest upload 2018-12-01-14.

Problematic requests for end-of-hour upload 2018-12-01T14:

spark.sql("""
SELECT
  hour,
  hostname,
  sequence,
  LAG(sequence) OVER hour_hostname_window AS prev_seq,
  (sequence - LAG(sequence) OVER hour_hostname_window) AS diff_seq,
  dt,
  LAG(dt) OVER hour_hostname_window AS prev_dt
FROM wmf_raw.webrequest
WHERE webrequest_source = 'upload'
  AND year = 2018
  AND month = 12
  AND day = 1
  AND hour = 14
  AND dt != '-'
HAVING (sequence - prev_seq > 10)
WINDOW hour_hostname_window AS (PARTITION BY hour, hostname ORDER by hour, sequence)
""").show(1000)
Mon, Dec 3, 7:35 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou added a comment to T211000: Failure while refining webrequest upload 2018-12-01-14.

Operational problem solved: the refinement job has been restarted with higher error-acceptance rates - Blocked jobs have caught up.
Some more analysis to come to try root-causing this.

Mon, Dec 3, 7:08 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time from Next Up to Paused on the Analytics-Kanban board.
Mon, Dec 3, 5:54 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T211000: Failure while refining webrequest upload 2018-12-01-14.

TL;DR: 83 requests in the failed hour are responsible for the failure. Their sequence-number is fairly bigger than the ones in the current hour, while their timestamp is still assigned to the hour (this is not the dt = '-' case).

Mon, Dec 3, 2:48 PM · Patch-For-Review, Analytics-Kanban, Analytics

Fri, Nov 30

JAllemandou added a comment to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time.

Info backfilled since beggining of time: https://grafana.wikimedia.org/dashboard/db/wikidata-co-editors?orgId=1&from=now-8y&to=now
Will keep an eye on next month run.

Fri, Nov 30, 9:21 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a project to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time: Analytics.
Fri, Nov 30, 2:59 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou added a comment to T193641: track number of editors from other Wikimedia projects who also edit on Wikidata over time.

Thanks for raising the issue. This is very bizarre.
The job for october was showing successful in our side. I reran it, and data showed up :(
I have the feeling this is not the first time this happens, something must be wrong somewhere.
I'am also going to run backfilling info.

Fri, Nov 30, 2:58 PM · Analytics-Kanban, Analytics, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Patch-For-Review, WMDE-Analytics-Engineering, Wikidata
JAllemandou moved T164201: AQS unique devices api should report offset/underestimate separately from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Fri, Nov 30, 2:11 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T210542: Update datasets definitions and oozie jobs for dual-sqoop of comments and actors from In Progress to In Code Review on the Analytics-Kanban board.
Fri, Nov 30, 2:10 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou renamed T210542: Update datasets definitions and oozie jobs for dual-sqoop of comments and actors from Update datasets definitions and oozie jobs to Update datasets definitions and oozie jobs for dual-sqoop of comments and actors.
Fri, Nov 30, 1:57 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T209031: Not able to scoop comment table in labs for mediawiki reconstruction process from Parent Tasks to In Progress on the Analytics-Kanban board.
Fri, Nov 30, 12:47 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou set the point value for T209031: Not able to scoop comment table in labs for mediawiki reconstruction process to 0.
Fri, Nov 30, 12:46 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou moved T209031: Not able to scoop comment table in labs for mediawiki reconstruction process from In Progress to Parent Tasks on the Analytics-Kanban board.
Fri, Nov 30, 12:46 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou moved T210110: [EventLogging Sanitization] Fix passing of input_path_regex params to Refine from Ready to Deploy to Done on the Analytics-Kanban board.
Fri, Nov 30, 12:45 PM · Patch-For-Review, Analytics-Kanban, Analytics
JAllemandou moved T209179: Update log_namespace, page_namespace from bigint to int from In Code Review to Done on the Analytics-Kanban board.
Fri, Nov 30, 12:45 PM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, Nov 29

JAllemandou assigned T206279: Hive join fails when using a HiveServer2 client to elukey.
Thu, Nov 29, 2:33 PM · Analytics-Kanban, Analytics-Cluster, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou moved T206279: Hive join fails when using a HiveServer2 client from Next Up to Done on the Analytics-Kanban board.
Thu, Nov 29, 2:33 PM · Analytics-Kanban, Analytics-Cluster, Analytics, Contributors-Analysis, Product-Analytics
JAllemandou added a comment to T206279: Hive join fails when using a HiveServer2 client.

This should be solved by https://gerrit.wikimedia.org/r/c/operations/puppet/cdh/+/474907

Thu, Nov 29, 2:32 PM · Analytics-Kanban, Analytics-Cluster, Analytics, Contributors-Analysis, Product-Analytics

Tue, Nov 27

JAllemandou added a comment to T209731: Choose HDFS paths and partitioning for ORES scores.

Another comment about folders that I hadn't thought before having read your update in the description: I actually think that the chosen is not the most efficient.
In term of data-retrieval, using /wmf/data/ores/score/wiki=enwiki/model=editquality/model_version=2.1/ or /wmf/data/ores/score/model=editquality/model_version=2.1/wiki=enwiki/ is very similar.
However it is different at deletion step: the latter is a lot easier, as it involves only a single delete for the whole model.

Tue, Nov 27, 8:42 PM · Analytics, ORES, Scoring-platform-team
JAllemandou added a comment to T203669: Return to real time banner impressions in Druid.

For reference, here is the request sent to druid for realtime ingestion:

curl -L -X POST -H 'Content-Type: application/json' -d '{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "test_kafka_event_centralnoticeimpression",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "flattenSpec": {
          "useFieldDiscovery": false,
          "fields": [
            "dt",
            { "type": "path", "name": "event_anonymous", "expr": "$.event.anonymous" },
            { "type": "path", "name": "event_banner", "expr": "$.event.banner" },
            { "type": "path", "name": "event_bannerCategory", "expr": "$.event.bannerCategory" },
            { "type": "path", "name": "event_bucket", "expr": "$.event.bucket" },
            { "type": "path", "name": "event_campaign", "expr": "$.event.campaign" },
            { "type": "path", "name": "event_campaignCategory", "expr": "$.event.campaignCategory" },
            { "type": "path", "name": "event_campaignCategoryUsesLegacy", "expr": "$.event.campaignCategoryUsesLegacy" },
            { "type": "path", "name": "event_country", "expr": "$.event.country" },
            { "type": "path", "name": "event_db", "expr": "$.event.db" },
            { "type": "path", "name": "event_device", "expr": "$.event.device" },
            { "type": "path", "name": "event_impressionEventSampleRate", "expr": "$.event.impressionEventSampleRate" },
            { "type": "path", "name": "event_project", "expr": "$.event.project" },
            { "type": "path", "name": "event_recordImpressionSampleRate", "expr": "$.event.recordImpressionSampleRate" },
            { "type": "path", "name": "event_region", "expr": "$.event.region" },
            { "type": "path", "name": "event_result", "expr": "$.event.result" },
            { "type": "path", "name": "event_status", "expr": "$.event.status" },
            { "type": "path", "name": "event_statusCode", "expr": "$.event.statusCode" },
            { "type": "path", "name": "event_uselang", "expr": "$.event.uselang" },
            "recvFrom",
            { "type": "path", "name": "ua_browser_family", "expr": "$.userAgent.browser_family" },
            { "type": "path", "name": "ua_browser_major", "expr": "$.userAgent.browser_major" },
            { "type": "path", "name": "ua_device_family", "expr": "$.userAgent.device_family" },
            { "type": "path", "name": "ua_is_bot", "expr": "$.userAgent.is_bot" },
            { "type": "path", "name": "ua_is_mediawiki", "expr": "$.userAgent.is_mediawiki" },
            { "type": "path", "name": "ua_os_family", "expr": "$.userAgent.os_family" },
            { "type": "path", "name": "ua_os_major", "expr": "$.userAgent.os_major" },
            { "type": "path", "name": "ua_wmf_app_version", "expr": "$.userAgent.wmf_app_version" },
            "webHost",
            "wiki"
          ]
        },
        "timestampSpec": {
          "column": "dt",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [
            "event_anonymous",
            "event_banner",
            "event_bannerCategory",
            "event_bucket",
            "event_campaign",
            "event_campaignCategory",
            "event_campaignCategoryUsesLegacy",
            "event_country",
            "event_db",
            "event_device",
            "event_impressionEventSampleRate",
            "event_project",
            "event_recordImpressionSampleRate",
            "event_region",
            "event_result",
            "event_status",
            "event_statusCode",
            "event_uselang",
            "recvFrom",
            "ua_browser_family",
            "ua_browser_major",
            "ua_device_family",
            "ua_is_bot",
            "ua_is_mediawiki",
            "ua_os_family",
            "ua_os_major",
            "ua_wmf_app_version",
            "webHost",
            "wiki"
          ]
        }
      }
    },
    "transformSpec": {
      "transforms": [
        {
          "type": "expression",
          "name": "event_inverseRecordImpressionSampleRate",
          "expression": "1 / event_recordImpressionSampleRate" }
      ]
    },
    "metricsSpec": [
      {
        "name": "event_count",
        "type": "count"
      },
      {
        "name": "event_normalized_count",
        "type": "doubleSum",
        "fieldName": "event_inverseRecordImpressionSampleRate"
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "SECOND"
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxRowsPerSegment": 5000000
  },
  "ioConfig": {
    "topic": "eventlogging_CentralNoticeImpression",
    "consumerProperties": {
      "bootstrap.servers": "kafka-jumbo1001.eqiad.wmnet:9092"
    },
    "taskCount": 1,
    "replicas": 3,
    "taskDuration": "PT10M"
  }
}' http://druid1001.eqiad.wmnet:8090/druid/indexer/v1/supervisor
Tue, Nov 27, 10:45 AM · Analytics-Kanban, User-Elukey, Analytics
JAllemandou added a comment to T203669: Return to real time banner impressions in Druid.

I have launched a realtime job indexing values flowing in kafka. Data can be seen here (please notice the event normalized count metric :) :

Tue, Nov 27, 10:25 AM · Analytics-Kanban, User-Elukey, Analytics

Mon, Nov 26

JAllemandou added a comment to T209731: Choose HDFS paths and partitioning for ORES scores.

Will the order of partitions make a difference? For example, if consumers are more likely to get multiple models of scores for a single wiki, vs. multiple wikis for a single model, would we put the wiki partition before model partition in the HDFS path, or would this have no effect?

Mon, Nov 26, 8:07 PM · Analytics, ORES, Scoring-platform-team
JAllemandou added a comment to T209731: Choose HDFS paths and partitioning for ORES scores.

I support the idea of using model name and version as partitions. Wiki_db would possibly be another good fit if requests will most often be on singular projects. Finally, if we partition on model and name, time is not needed I assume.

Mon, Nov 26, 7:29 PM · Analytics, ORES, Scoring-platform-team
JAllemandou updated the task description for T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
Mon, Nov 26, 1:21 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou moved T208614: Presto on Cloud Platform Design Document from Next Up to In Progress on the Analytics-Kanban board.
Mon, Nov 26, 1:19 PM · Analytics-Kanban, Analytics

Fri, Nov 23

JAllemandou added a comment to T208569: Get Wikidata clickstream.

Hi @GoranSMilovanovic, the code we use to generate monthly data is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala
As per the clickstream database in Hive, it's not used anymore, it's a left-over from Ellerys time.

Fri, Nov 23, 2:18 PM · User-GoranSMilovanovic, WMDE-Analytics-Engineering, Wikidata, Wikidata-Termbox-Hike

Thu, Nov 22

JAllemandou added a comment to T210006: Event counts from Mysql and Hive don't match. Refine is persisting data from crawlers. .

@Nuria : Do we know why those events don't get inserted into MySQL?

Thu, Nov 22, 8:55 AM · Product-Analytics, Analytics

Wed, Nov 21

JAllemandou added a comment to T203669: Return to real time banner impressions in Druid.

We could add an event property with the value calculated per event--that is, the 1/sample rate. I imagine that float value could be put directly in Druid, which would then just have to sum all the values in the results of each query. Can Druid/Turnilo do that? That way, perhaps we could still have the metric with the new tool?

The solution that you proposed though might work - if 1/sample-rate was available in the event itself we could ingest it to druid as measure (and hence set correctly as float/double) and possibly plot it to Turnilo without too many problems. @mforns what do you think?

Wed, Nov 21, 12:06 PM · Analytics-Kanban, User-Elukey, Analytics

Tue, Nov 20

JAllemandou moved T209536: Hive query fails with local join from Next Up to Done on the Analytics-Kanban board.
Tue, Nov 20, 3:58 PM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics

Mon, Nov 19

JAllemandou added a comment to T209742: Purge ORES scores from Hadoop and begin backfill when model version changes.

Indeed in hadoop there is no such thing as 'in place'. The way to go could be to use model version as a partition-key. You'd backfill a new version, keeping the old ones (and therefore not breaking queries on old ones). Then the decision is on how many versions to keep, and whether to keep them update or not.

Mon, Nov 19, 1:31 PM · Analytics, ORES, Scoring-platform-team
JAllemandou added a comment to T209731: Choose HDFS paths and partitioning for ORES scores.

It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists those to hive partitions cannot be schema dependent

Mon, Nov 19, 1:29 PM · Analytics, ORES, Scoring-platform-team
JAllemandou added a comment to T209611: [Epic] Make ORES scores available in Hadoop and as a dump.

When data gets stored in Hadoop, it is easy to supply pageviews-like dumps files.
About how to compute the scores, hitting the ORES API is probably the shortest solution in term of schedule, next steps might be to try to reproduce ORES scoring in PySpark. I already started some POC of that, but didn't move to the core of the thing: feature-extraction from mediawiki-text instead of mediawiki-api.

Mon, Nov 19, 1:25 PM · Dumps-Generation, Analytics, Epic, ORES, revscoring, artificial-intelligence, Scoring-platform-team
JAllemandou added a comment to T209536: Hive query fails with local join.

Hmm, the big uptick in errors I described above happened using HiveServer2. Do you mean that this happens on both, whereas T206279 happens on HiveServer2 only?

Mon, Nov 19, 1:21 PM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou added a comment to T206883: mediawiki_history datasets have null user_text for IP edits.

I hear your point and it makes a lot of sense. I think our views differ in the notion of current name. In my world a current name is associated to events only when we're sure those events have been made by the same person/account. In my world the event_user_text field references single users. I however understand that taken from a purely name-changing perspective, an IP has the same value before and now. My concern lay in misrepresentations of IP changes in time: If you use an IP as current-name for an edit, you might be tempted to consider other edits made by that IPs as belonging to the same user - Which is false (but true in the case of non-anonymous edits).

Mon, Nov 19, 1:19 PM · Product-Analytics, Analytics-Data-Quality, Analytics

Fri, Nov 16

JAllemandou added a comment to T206883: mediawiki_history datasets have null user_text for IP edits.

Hi @Neil_P._Quinn_WMF ,
While I understand the usage frustration, keeping the IPs in event_user_text_historical is for me a matter of data correctness.
Doing it represents nothing in term of code change, but I'd rather not do it to keep the semantics of event_user_text and event_user_text_historical valid and similar for anonymous and non-anonymous edits.
Happy to continue the discussion and to get other opnions weighting :)

Fri, Nov 16, 8:27 AM · Product-Analytics, Analytics-Data-Quality, Analytics
JAllemandou added a comment to T209536: Hive query fails with local join.

Is this related to T206279 ?

Fri, Nov 16, 8:23 AM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou added a comment to T209655: Copy Wikidata dumps to HDFs.

Not sure when we'll be able to do that. However there is more recent dump available: /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001 :)

Fri, Nov 16, 8:17 AM · Wikidata, Research, Analytics

Thu, Nov 15

JAllemandou added a comment to T88775: Add mediacounts to pageview API.

Did some analysis in term of data size and storage:

  • Our Cassandra instances (2 per host) each have 2.9Tb usable space.
  • We currently use ~720Gb per instance (this accounts for all keyspaces, replication included).
  • 98% of those 720Gb is used for pageview-per-article daily data
Thu, Nov 15, 8:42 PM · Multimedia, Analytics
JAllemandou updated subscribers of T209536: Hive query fails with local join.

Ping @elukey, @Ottomata : I think we should apply hive.auto.convert.join = false in hive-site.xml so that map-side joins are never done automatically.

Thu, Nov 15, 5:59 PM · Patch-For-Review, Product-Analytics, Analytics-Kanban, Analytics
JAllemandou updated the task description for T209178: Refactor Mediawiki-Database ingestion.
Thu, Nov 15, 9:38 AM · Analytics-Kanban, Analytics
JAllemandou renamed T209178: Refactor Mediawiki-Database ingestion from Long term solution for sqooping comments to Refactor Mediawiki-Database ingestion.
Thu, Nov 15, 9:27 AM · Analytics-Kanban, Analytics

Tue, Nov 13

JAllemandou added a comment to T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

Thanks @chasemp for raising the point. We are aware of the 2 steps for sanitization.

Tue, Nov 13, 8:12 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou added a comment to T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

Thanks again a lot @Anomie, @daniel, @Bstorm for having chimed in, your questions and suggestions have helped us move forward.

  • Short term solution: We have tested exporting data from the logging_compat view and it's fast enough to get us by for now. A lot faster than getting the comment view on its own, or getting the joint logging and comment views. But we still can't export the revision table using the revision_compat view, it's too slow.
  • Long term solution : Having invested more time into understanding how the sanitization of data happens for labs, we think that it would be a better solution to use the analytics-replica to get the data from MariaDB, and then sanitize it in hadoop. This will be a long project, and I will reuse the parent task to coordinate. Ideally, we'll just use your maintain-views.yaml logic directly, so we stay in sync with any changes made on currently existing views sanitization.
Tue, Nov 13, 7:52 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou added a comment to T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

@daniel : We were using sanitized comment view (named comment), not the compat one (named revision_compat).
We tried two things:

  • export the comment view (select comment_id, comment_text from comment) - but this is very slow due to the way the comment view is built.
  • export a manually built equivalent of the logging_compat view - we were doing the join in the query, and it was also very slow as it took more than a day for the logging table.
Tue, Nov 13, 1:36 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou created P7798 labs-druid-query.
Tue, Nov 13, 1:15 PM
JAllemandou added a comment to T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
  • Access to underlying tables - We could query the underlying tables, and that would bypass any performance problems we have with the views. We would duplicate the sanitizing logic from the views, and maintain it to be always the same as it is in cloud db. This would require special permissions to the cloud db.

At that point you wouldn't be using the provided views so there'd be no point connecting out to the cloud DB replicas. And I don't think you'd get special permissions.

Tue, Nov 13, 10:06 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou added a comment to T177965: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data.

Ping @Nuria - Can we close this parent task?

Tue, Nov 13, 8:33 AM · Analytics-Kanban, Analytics-Wikistats
JAllemandou moved T209031: Not able to scoop comment table in labs for mediawiki reconstruction process from Next Up to In Progress on the Analytics-Kanban board.
Tue, Nov 13, 8:32 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou moved T209178: Refactor Mediawiki-Database ingestion from In Progress to Parent Tasks on the Analytics-Kanban board.
Tue, Nov 13, 8:32 AM · Analytics-Kanban, Analytics
JAllemandou set the point value for T209178: Refactor Mediawiki-Database ingestion to 0.
Tue, Nov 13, 8:32 AM · Analytics-Kanban, Analytics
JAllemandou updated subscribers of T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

ping @Bstorm for the above comment, as she's the one having setup views.
Thanks

Tue, Nov 13, 8:31 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou added a parent task for T209031: Not able to scoop comment table in labs for mediawiki reconstruction process: T209178: Refactor Mediawiki-Database ingestion.
Tue, Nov 13, 8:28 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou added a subtask for T209178: Refactor Mediawiki-Database ingestion: T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
Tue, Nov 13, 8:27 AM · Analytics-Kanban, Analytics
JAllemandou moved T202489: Copy monthly XML files from public-dumps to HDFS from Ready to Deploy to Done on the Analytics-Kanban board.
Tue, Nov 13, 8:21 AM · Patch-For-Review, Analytics-Kanban, Research, Analytics

Mon, Nov 12

JAllemandou added a comment to T202489: Copy monthly XML files from public-dumps to HDFS.

@Ottomata , @ArielGlenn - I'm ok with copying the dumps from a stat machine.
Let's see what @elukey thinks of it

Mon, Nov 12, 9:22 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou updated subscribers of T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

Thanks @Anomie. We (analytics team) also had thought of a third potential solution. I list the 3 solutions below by increasing level of complexity (IMO),

Mon, Nov 12, 9:21 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou added a comment to T202489: Copy monthly XML files from public-dumps to HDFS.

Excuse me for butting in at this late date but these files are already available from labstore1006,7 to labs instances and on stats100? (I forget which one now). Do you need them to be available somewhere else?

Mon, Nov 12, 6:32 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T209178: Refactor Mediawiki-Database ingestion from Next Up to In Progress on the Analytics-Kanban board.
Mon, Nov 12, 5:23 PM · Analytics-Kanban, Analytics

Nov 8 2018

JAllemandou moved T208550: bothersome output in hive when querying events database from In Code Review to Done on the Analytics-Kanban board.
Nov 8 2018, 5:01 PM · Patch-For-Review, Analytics-Kanban, Analytics
Krenair awarded T209031: Not able to scoop comment table in labs for mediawiki reconstruction process a Evil Spooky Haunted Tree token.
Nov 8 2018, 2:46 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou updated subscribers of T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
Nov 8 2018, 2:02 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou updated subscribers of T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

After a chat in #wikimedia-cloud chan on IRC, here is what we found with @Krenair :
The comment view is defined through linking to all parent tables (the tables from which comments come), enforcing existence and non-masking of the comment (see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#291).
This means that getting min and max ids of that table require a full table scan with 9 joins for each rows. This explains the time (20sec for a very small wiki, less than 170k revisions).

Nov 8 2018, 2:00 PM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou updated the task description for T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
Nov 8 2018, 10:06 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou updated the task description for T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
Nov 8 2018, 10:06 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics
JAllemandou created T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.
Nov 8 2018, 10:00 AM · Core Platform Team Backlog (Watching / External), Analytics-Kanban, DBA, Data-Services, Analytics

Nov 6 2018

JAllemandou moved T208752: webrequest data loss 2018-11-05 on upload partition from In Code Review to Done on the Analytics-Kanban board.
Nov 6 2018, 7:15 PM · Analytics-Kanban, Patch-For-Review, Operations, Traffic, Analytics
JAllemandou added a comment to T208752: webrequest data loss 2018-11-05 on upload partition .

To keep archive happy: Failed refinement hours were 13->19 (included) on webrequest_source = 'upload', 2018-11-05.
I refined them manually 2018-11-06 and marked them as done manually as well.

Nov 6 2018, 7:13 PM · Analytics-Kanban, Patch-For-Review, Operations, Traffic, Analytics
JAllemandou moved T208550: bothersome output in hive when querying events database from Next Up to In Code Review on the Analytics-Kanban board.
Nov 6 2018, 9:56 AM · Patch-For-Review, Analytics-Kanban, Analytics

Nov 5 2018

JAllemandou moved T164020: Use spark to split webrequest on tags from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Nov 5 2018, 1:08 PM · Analytics, Patch-For-Review, Analytics-Kanban
JAllemandou moved T202490: Automate XML-to-parquet transformation for XML dumps (oozie job) from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 5 2018, 10:02 AM · Patch-For-Review, Analytics-Kanban, Research, Analytics
JAllemandou moved T197888: Drop old mediawiki_history_reduced snapshots from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 5 2018, 9:48 AM · Analytics, Patch-For-Review, Analytics-Kanban
JAllemandou moved T208237: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention) from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 5 2018, 9:47 AM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T208294: Update alert email address in oozie mediawiki-load job from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 5 2018, 9:44 AM · Analytics-Kanban
JAllemandou moved T205940: Add change tag tables to monthly mediawiki_history sqoop from Ready to Deploy to Done on the Analytics-Kanban board.
Nov 5 2018, 9:40 AM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-Cluster, Contributors-Analysis, Product-Analytics
JAllemandou added a comment to T208550: bothersome output in hive when querying events database.

The only reason I can think of for this issue to happen would be that you and someone else have a hive query writing parquet-logs at the same time.
I'd love to be able to provide a better naming for the log-files (embedding username at least), but java-logging configuration doesn't allow that easily by default.
Maybe there are ways to do differently?

Nov 5 2018, 8:54 AM · Patch-For-Review, Analytics-Kanban, Analytics

Nov 2 2018

JAllemandou moved T208377: Fix refinery-source jenkins build/release jobs from In Progress to Done on the Analytics-Kanban board.
Nov 2 2018, 6:38 PM · Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster.

TL;DR: I think 2/3Tb per host is just enough for a start, but might quickly become too small.
Details:
In term of storage, Hadoop has a default replication factor of 3, giving you (actual space / 3) usage space; so roughly 5Tb.
As @Ottomata pointed, the dataset we want to provide is less than 1Tb (currently 770Gb, growing). We will probably keep some of them (let's assume 3/4, taking into account the copying time period), and we will need enough space on the machine used as an entry-point to the dataset to have it full (copy locally from dumps, then to HDFS - Might be doable streaming).
Lastly, we also need to consider space for logs. Hadoop generates a big bunch of logs, and they have proven very usefull when debugging. I don't enough yet about presto, but we should keep some space for this.

Nov 2 2018, 9:43 AM · Analytics-Kanban, netops, Operations, Analytics
JAllemandou added a comment to T208550: bothersome output in hive when querying events database.

@Nuria: This is related to the fix we deployed to try to prevent the logging lines (https://gerrit.wikimedia.org/r/c/operations/puppet/cdh/+/469499)
Can you precise which machine you were using when you got those logs?

Nov 2 2018, 9:29 AM · Patch-For-Review, Analytics-Kanban, Analytics

Oct 31 2018

JAllemandou moved T208377: Fix refinery-source jenkins build/release jobs from Next Up to In Progress on the Analytics-Kanban board.
Oct 31 2018, 8:22 AM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T208377: Fix refinery-source jenkins build/release jobs.
Oct 31 2018, 8:22 AM · Patch-For-Review, Analytics-Kanban
JAllemandou created T208377: Fix refinery-source jenkins build/release jobs.
Oct 31 2018, 8:22 AM · Patch-For-Review, Analytics-Kanban

Oct 30 2018

JAllemandou added a comment to T178832: Investigate AQS cassandra schema hash warninga.

Here are my findings:

  • The warning message happens for 2 schemas of AQS: pageviews.per.article.flat and top.pageviews. The difference between schema definitions are on the version field. The one store in database is 2 (verified in cassandra), the one coming from AQS definition is 3 (verified in code).
  • In this commit we have changed the version number of 3 schemas: pageviews.per.article.flat, top.pageviews and pageviews.per.project.v2. The 2 schemas raising warning messages are present, and there is the pageviews.per.project.v2 that doesn't raise warnings. Tis due to the database containing 3 for that schema version, which matches the AQS defined version number.
  • Looking at the restbase-mod-table-cassandra git repo, I found that schema migration are driven by changes in code-defined-configuration (as expected, a new schema line is stored in the cassandra meta table for the targetted keyspace), and that this new row is added only if a real schema change happens (see this function). Since we updated the version number without having actually changed the schema, no DB updates were made, hence throwing warnings (for more than a year :( Sadness).
Oct 30 2018, 2:28 PM · Analytics
JAllemandou moved T205940: Add change tag tables to monthly mediawiki_history sqoop from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Oct 30 2018, 10:38 AM · Patch-For-Review, Analytics-Kanban, Analytics, Analytics-Cluster, Contributors-Analysis, Product-Analytics
JAllemandou moved T208294: Update alert email address in oozie mediawiki-load job from Next Up to Ready to Deploy on the Analytics-Kanban board.
Oct 30 2018, 8:45 AM · Analytics-Kanban
JAllemandou assigned T208294: Update alert email address in oozie mediawiki-load job to Milimetric.
Oct 30 2018, 8:44 AM · Analytics-Kanban
JAllemandou created T208294: Update alert email address in oozie mediawiki-load job.
Oct 30 2018, 8:43 AM · Analytics-Kanban

Oct 29 2018

JAllemandou moved T208237: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention) from Next Up to Ready to Deploy on the Analytics-Kanban board.
Oct 29 2018, 4:24 PM · Patch-For-Review, Analytics-Kanban
JAllemandou claimed T208237: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention).
Oct 29 2018, 4:23 PM · Patch-For-Review, Analytics-Kanban
JAllemandou created T208237: Rename insertion_ts to insertion_dt in pageview_whitelist tabler (convention).
Oct 29 2018, 4:23 PM · Patch-For-Review, Analytics-Kanban
JAllemandou moved T164020: Use spark to split webrequest on tags from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Oct 29 2018, 3:54 PM · Analytics, Patch-For-Review, Analytics-Kanban
JAllemandou added a comment to T206894: Set up automated email to report completion of mediawiki_history snapshot and Druid loading.

If we want an email to be sent once data is available on the cluster, no need to create a new oozie job. Adding a step sending the email in the currently existing job is easier I think.

Oct 29 2018, 10:51 AM · Analytics, Contributors-Analysis, Product-Analytics
JAllemandou created T208208: Allow hadoop prod jobs to preempt resource over default queue.
Oct 29 2018, 10:38 AM · Analytics-Kanban, Patch-For-Review, Analytics

Oct 26 2018

JAllemandou added a comment to T189475: Identify common abuse filters that affect translations.

It works in that it is queryable, but for small data only (times out otherwise). The thing I have not tested is to build charts out of data.

Oct 26 2018, 4:40 PM · Language-Team (Language-2018-October-December), CX-analytics

Oct 25 2018

JAllemandou added a comment to T206279: Hive join fails when using a HiveServer2 client.

Reopening because we have an idea:

We will set SET hive.auto.convert.join=false; in hive-server2 only, (not for hive CLI clients). This should make the default for clients that use hive-server (like notebooks) to have this turned off.

Oct 25 2018, 7:38 PM · Analytics-Kanban, Analytics-Cluster, Analytics, Contributors-Analysis, Product-Analytics