Page MenuHomePhabricator

fkaelin
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 12 2020, 6:16 PM (49 w, 4 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
FKaelin (WMF) [ Global Accounts ]

Recent Activity

Wed, Oct 13

fkaelin closed T290170: Analysis for productionizing content gap metrics as Resolved.
Wed, Oct 13, 7:03 PM · Research (FY2021-22-Research-July-Sept)
fkaelin added a comment to T290170: Analysis for productionizing content gap metrics.

The exploratory work is completed, we are moving to implementing the first two gaps in T293273 for Q2.

Wed, Oct 13, 3:59 PM · Research (FY2021-22-Research-July-Sept)
fkaelin created T293273: Knowledge gap end-to-end pipelines.
Wed, Oct 13, 3:56 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin moved T290173: Orchestration of end-to-end machine learning workloads from FY2021-22-Research-July-Sept to FY2021-22-Research-Oct-Dec on the Research board.
Wed, Oct 13, 3:54 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290173: Orchestration of end-to-end machine learning workloads.

update

  • skein based operator that deploys a key/value store on demand using tikv
Wed, Oct 13, 3:53 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin moved T290172: Wikimedia research code repository from FY2021-22-Research-July-Sept to FY2021-22-Research-Oct-Dec on the Research board.
Wed, Oct 13, 3:52 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290172: Wikimedia research code repository .

updates:

  • generic spark job airflow dag template
  • configurable airflow dag to download media from swift
Wed, Oct 13, 3:52 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin moved T290171: Scheduling production pipelines from FY2021-22-Research-July-Sept to FY2021-22-Research-Oct-Dec on the Research board.
Wed, Oct 13, 3:52 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290171: Scheduling production pipelines.

no updates for last week

Wed, Oct 13, 3:51 PM · Research (FY2021-22-Research-Oct-Dec)

Sun, Oct 10

fkaelin updated the task description for T291453: Outreachy Application Task: Develop an Image Similarity API.
Sun, Oct 10, 6:54 PM · Outreachy (Round 23)
fkaelin added a comment to T291453: Outreachy Application Task: Develop an Image Similarity API.

I updated the description with a link to the notebook. Please let us know if you run into problems.

Sun, Oct 10, 6:54 PM · Outreachy (Round 23)

Mon, Sep 27

fkaelin added a comment to T290170: Analysis for productionizing content gap metrics.
Mon, Sep 27, 2:44 PM · Research (FY2021-22-Research-July-Sept)
fkaelin added a comment to T290171: Scheduling production pipelines.
  • cross team discussions for how to package/distribute python dependencies, which will enable us to submit spark jobs in cluster mode
Mon, Sep 27, 2:42 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290173: Orchestration of end-to-end machine learning workloads.
  • first wip ML pipeline that uses spark, bash and skein operators in single dag
Mon, Sep 27, 2:40 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290172: Wikimedia research code repository .
  • added research-transform python package for reusable research focused code
  • added wip knowledge gap code
  • instructions interactive development, either from jupyter notebook or ipython/VS code
Mon, Sep 27, 2:40 PM · Research (FY2021-22-Research-Oct-Dec)

Sep 21 2021

fkaelin updated the task description for T291071: Develop an Image Similarity Tool.
Sep 21 2021, 3:02 AM · Outreachy (Round 23), Outreach-Programs-Projects
fkaelin updated the task description for T291453: Outreachy Application Task: Develop an Image Similarity API.
Sep 21 2021, 2:59 AM · Outreachy (Round 23)
fkaelin created T291453: Outreachy Application Task: Develop an Image Similarity API.
Sep 21 2021, 2:58 AM · Outreachy (Round 23)

Sep 20 2021

fkaelin added a comment to T290170: Analysis for productionizing content gap metrics.

weekly updates:

  • continued analysis, implementing end-to-end pipeline for gender gap
  • deep dive with Marc scheduled
Sep 20 2021, 1:39 PM · Research (FY2021-22-Research-July-Sept)
fkaelin added a comment to T290171: Scheduling production pipelines.

updates

Sep 20 2021, 1:38 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290172: Wikimedia research code repository .

updates:

Sep 20 2021, 1:38 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290173: Orchestration of end-to-end machine learning workloads.

No weekly updates.

Sep 20 2021, 1:38 PM · Research (FY2021-22-Research-Oct-Dec)

Sep 13 2021

fkaelin added a comment to T284225: Create airflow instances for Platform Engineering and Research.

Submitting a spark job from an airflow instance results in a hadoop/hdfs permission error AccessControlException: Permission denied: user=analytics-research, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x.

Sep 13 2021, 7:42 PM · Patch-For-Review, Analytics-Kanban, Research, Platform Engineering, Analytics
fkaelin added a comment to T290173: Orchestration of end-to-end machine learning workloads.

No weekly updates.

Sep 13 2021, 2:31 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290171: Scheduling production pipelines.

Weekly update:

  • first dags run on the research airflow instance
  • wip support for deploying spark jobs
Sep 13 2021, 2:31 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290172: Wikimedia research code repository .

Weekly update:

  • transform into python package for library use
  • wip deploy airflow dags
Sep 13 2021, 2:31 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin added a comment to T290170: Analysis for productionizing content gap metrics.

Weekly update:

Sep 13 2021, 2:31 PM · Research (FY2021-22-Research-July-Sept)

Sep 1 2021

fkaelin updated the task description for T290171: Scheduling production pipelines.
Sep 1 2021, 3:56 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin created T290173: Orchestration of end-to-end machine learning workloads.
Sep 1 2021, 3:55 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin created T290172: Wikimedia research code repository .
Sep 1 2021, 3:53 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin created T290171: Scheduling production pipelines.
Sep 1 2021, 3:52 PM · Research (FY2021-22-Research-Oct-Dec)
fkaelin created T290170: Analysis for productionizing content gap metrics.
Sep 1 2021, 3:51 PM · Research (FY2021-22-Research-July-Sept)

Jun 7 2021

fkaelin added a comment to T280385: Apache Beam go prototype code for DP evaluation.

One point of confusion that I have on a conceptual level: currently we apply what we loosely call a k-anonymity threshold, e.g. there would need to be at least k pageviews for a given tuple (time, wiki, page, country) for that tuple to be included in a public dataset. As @Nuria pointed out, we already release pageview counts without geographical dimension, and it would be great to find a way to include e.g. the country. One intuitive concern in the risk that a bad actor can identify users who read pages on a smaller wiki from a country that has few speakers of that language - e.g. a dissident who had to flee their home country. The k-anonymity approach forces us to define what an acceptable risk is (is it 100 pages views for a given page from a given country? 1000?), which is difficult and rather handwavey. However, that risk is not based on plausible deniability - e.g. if a bad actor is looking to identify users who have read a particular page and there were 68 views from a given country, it doesn't seem sufficient to be able to say that 68 is not the "true number" because we applied DP.

Jun 7 2021, 4:02 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

May 25 2021

fkaelin closed T278451: NullPointerException at beginning of spark job as Resolved.

Apologies for the delay - I haven't been running larger avro based jobs recently, and I wasn't able to find a minimal example when I created this task. I am closing this since it is a single exception per job which didn't cause them to fail.

May 25 2021, 2:14 PM · Analytics

Apr 8 2021

fkaelin added a comment to T278217: Release image data for training.

Update on the competition images dataset:

  • downloaded 300px thumbnails from swift
  • total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
  • 6711755 images
  • 32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/
Apr 8 2021, 3:20 AM · Research (FY2020-21-Research-April-June)

Mar 31 2021

fkaelin added a comment to T272313: Newpytyer python spark kernels.

Thanks @Ottomata, I can also confirm that the certificates work now too, ie a request with verify=False now fails on the workers as well.

Mar 31 2021, 7:37 PM · Patch-For-Review, Analytics

Mar 25 2021

fkaelin updated the task description for T278441: Memory errors in Spark.
Mar 25 2021, 5:51 PM · Analytics-Kanban, Analytics
fkaelin created T278451: NullPointerException at beginning of spark job.
Mar 25 2021, 4:07 PM · Analytics
fkaelin created T278441: Memory errors in Spark.
Mar 25 2021, 3:08 PM · Analytics-Kanban, Analytics

Mar 24 2021

fkaelin added a comment to T215001: Revisions missing from mediawiki_revision_create.

Thanks for the background on where the special:tag info is stored @Milimetric on irc, though since it is in the refined revisions and not the raw revisions table used so far, pulling that in is a bit more work. That said, just looking at the actual revisions that are missing does seem to indicate that multiple specific scenarios don't result in kafka events being created.

Mar 24 2021, 4:41 PM · Patch-For-Review, MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), Event-Platform, Growth-Team-Filtering, Analytics-Kanban, Growth-Team, Product-Analytics, Analytics
fkaelin added a comment to T215001: Revisions missing from mediawiki_revision_create.

After going a little overboard still no easy answers. I did slice and dice the data based on the query that @Milimetric provided above.

Mar 24 2021, 4:53 AM · Patch-For-Review, MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), Event-Platform, Growth-Team-Filtering, Analytics-Kanban, Growth-Team, Product-Analytics, Analytics

Mar 16 2021

fkaelin added a comment to T184744: Improve access to Commons image data for research and development.

I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,

  • what does the increase in put 201 in the object state-changing? cache misses for the thumbnails that get filled?

Possible but hard to say from that graph, when did the job start/finish ? I'm assuming ~23:30 to ~1:40 but best to confirm
Something else to check for thumbnailing activity is the Thumbor dashboard (for the same timeframe):
https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=1615330800000&to=1615341600000

Mar 16 2021, 7:00 PM · User-ArielGlenn
fkaelin added a comment to T184744: Improve access to Commons image data for research and development.

messed up the link to the grafana above (edited) and adding it as a screenshot

swift_grafana.png (1×3 px, 534 KB)

Mar 16 2021, 4:37 PM · User-ArielGlenn
fkaelin added a comment to T184744: Improve access to Commons image data for research and development.

I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,

Mar 16 2021, 4:33 PM · User-ArielGlenn

Mar 15 2021

fkaelin added a comment to T276791: Configure the Hadoop cluster to use the GPUs available on some workers.

I don't think splitting the GPU machines from the yarn cluster is a far fetched idea, especially given the hurdles of making this work with yarn - though I am not familiar with Alluxio. Another option is to create a kubernetes cluster that could make use of these GPUs, which would be in line with the technology stack used by other ML infra projects currently being built (ML platform, search infra). These GPU are a good example of the gap that I perceive in regards the analytics infrastructure and ongoing efforts to build ML infrastructure. I created a doc to discuss the larger question, from the perspective of the research team as a user of ML infrastructure.

Mar 15 2021, 3:37 PM · Analytics, Machine-Learning-Team
fkaelin added a comment to T275551: Kubeflow on stat machines.

I created a separate document to discuss some of the bigger questions around orchestration within analytics that arise from discussing the very specific use case of 'kubeflow on stat machines', any input is much appreciated. On this phab, I would like to continue discussing our short/near term options.

Mar 15 2021, 2:12 PM · Analytics-Radar, Machine-Learning-Team, SRE

Mar 2 2021

fkaelin added a comment to T272313: Newpytyer python spark kernels.

I second Isaac`s comment. I reviewed the gh PR and tested successfully.

Mar 2 2021, 4:49 PM · Patch-For-Review, Analytics

Mar 1 2021

fkaelin added a comment to T224658: Newpyter - SWAP Juypter Rewrite.

Another observation: I attempted to use wmfdata to avoid replicating spark session code. The wmf base conda env contains an older version, and upgrading it fails with

Mar 1 2021, 5:51 PM · Analytics-Kanban, Patch-For-Review, Analytics

Feb 23 2021

fkaelin created T275551: Kubeflow on stat machines.
Feb 23 2021, 7:55 PM · Analytics-Radar, Machine-Learning-Team, SRE

Feb 11 2021

fkaelin added a comment to T182351: Make HTML dumps available.

To summarize my understanding:

  • for research, the html history is interesting because it expands templates and lua modules
  • for a revision of page p created at time t, we prefer to store the html that a reader was served at that time (ie what WikiHist does), rather than the html using the version of the templates at some time in the future (ie by calling the mediawiki api during an batch export)
  • however, if at time t+1 a template that is used by page p changed, then the reader was served a different html on wikipedia but there is not any revision for page p. Only once there is an new revision for page p at time t+2 will the change of the template at time t+1 be reflected in the history. In fact page p is not edited after time t, the template change will never be reflected in the html history of the page.
Feb 11 2021, 9:54 PM · Research, Analytics-Radar, Datasets-Archiving

Feb 10 2021

fkaelin added a comment to T182351: Make HTML dumps available.

@ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this is described in @tizianopiccardi's paper.

Feb 10 2021, 11:54 PM · Research, Analytics-Radar, Datasets-Archiving
fkaelin claimed T182351: Make HTML dumps available.
Feb 10 2021, 7:34 PM · Research, Analytics-Radar, Datasets-Archiving

Feb 1 2021

fkaelin updated subscribers of T272973: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance.
Feb 1 2021, 1:42 PM · Patch-For-Review, Analytics-Kanban, Analytics

Jan 22 2021

fkaelin added a comment to T184744: Improve access to Commons image data for research and development.

Thanks for the pointers @fgiunchedi and @Miriam. This approach works well, including scaling the download on spark.

Jan 22 2021, 3:11 PM · User-ArielGlenn

Jan 18 2021

fkaelin created T272313: Newpytyer python spark kernels.
Jan 18 2021, 4:52 PM · Patch-For-Review, Analytics

Jan 14 2021

fkaelin added a comment to T184744: Improve access to Commons image data for research and development.

Thanks for the information @fgiunchedi.

Jan 14 2021, 6:22 AM · User-ArielGlenn

Jan 13 2021

fkaelin added a comment to T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.

@elukey, thanks for the background and for adding my user to Hue - I was able to login.

Jan 13 2021, 2:05 PM · SRE, SRE-Access-Requests

Jan 12 2021

fkaelin updated subscribers of T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.

I am trying to access Hue, and after looking at these tasks requesting access for Hue T271602 and T252703 I am not sure if there is a template task for requesting access? If I understand @elukey 's comment, there is a new'ish way to request UI credentials only, instead of the more involved ssh access. However, I would imagine that most people requesting ssh access will also end up using the UI, so would it make to sense to create the the UI based creds as part of this task as well?

Jan 12 2021, 10:07 PM · SRE, SRE-Access-Requests

Jan 5 2021

fkaelin added a comment to T184744: Improve access to Commons image data for research and development.

Hi, revisiting this subject! With T220081 the swift cluster is reachable from analytics, does this allow us to proceed with one or both of the options described?

Jan 5 2021, 9:32 PM · User-ArielGlenn

Nov 20 2020

fkaelin created T268365: Kerberos identity for fkaelin.
Nov 20 2020, 6:45 PM · Analytics

Nov 18 2020

fkaelin closed T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin as Resolved.

Thanks for the explanation, it makes sense now.

Nov 18 2020, 8:57 PM · SRE, SRE-Access-Requests
fkaelin reopened T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin as "Open".

Thanks for the assistance! I am trying to connect, and key seems to have propagated but I see the following error

$ ssh -v bast1002.eqiad.wmnet
[snip]
debug1: Next authentication method: publickey
debug1: Offering public key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent
debug1: Server accepts key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent
debug1: Authentication succeeded (publickey).
Authenticated to bast1002.wikimedia.org ([208.80.154.86]:22).
debug1: channel_connect_stdio_fwd bast1002.eqiad.wmnet:22
debug1: channel 0: new [stdio-forward]
debug1: getpeername failed: Bad file descriptor
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
channel 0: open failed: administratively prohibited: open failed
stdio forwarding failed
kex_exchange_identification: Connection closed by remote host
Nov 18 2020, 6:25 PM · SRE, SRE-Access-Requests

Nov 16 2020

fkaelin added a comment to T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.

Also, I noticed that there is an previous outdated entry for me in that yaml file. https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml$1733

Nov 16 2020, 2:25 PM · SRE, SRE-Access-Requests
fkaelin added a comment to T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.

Thanks. I did create a separate task for the analytics-privatedata-users group, which seemingly wasn't necessary. https://phabricator.wikimedia.org/T267816

Nov 16 2020, 2:17 PM · SRE, SRE-Access-Requests

Nov 13 2020

fkaelin updated subscribers of T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.

@Ottomata and @leila, I think your approvals are needed for this review, thank you!

Nov 13 2020, 1:46 AM · SRE, SRE-Access-Requests
fkaelin updated subscribers of T267816: Requesting access to analytics-privatedata-users for fkaelin.

@Ottomata and @leila, I think your approvals are needed for this review, thank you!

Nov 13 2020, 1:46 AM · SRE, SRE-Access-Requests
fkaelin created T267817: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin.
Nov 13 2020, 1:35 AM · SRE, SRE-Access-Requests
fkaelin created T267816: Requesting access to analytics-privatedata-users for fkaelin.
Nov 13 2020, 1:35 AM · SRE, SRE-Access-Requests