Wed, Oct 13
The exploratory work is completed, we are moving to implementing the first two gaps in T293273 for Q2.
- skein based operator that deploys a key/value store on demand using tikv
- generic spark job airflow dag template
- configurable airflow dag to download media from swift
no updates for last week
Sun, Oct 10
I updated the description with a link to the notebook. Please let us know if you run into problems.
Mon, Sep 27
- code consolidated in https://gitlab.wikimedia.org/FabianKaelin/research-ml/-/tree/fab/knowledge_gaps/research-transform/research_transform/knowledge_gaps
- promising initial comparison between metrics generated by the spark pipeline vs existing code
- cross team discussions for how to package/distribute python dependencies, which will enable us to submit spark jobs in cluster mode
- first wip ML pipeline that uses spark, bash and skein operators in single dag
- added research-transform python package for reusable research focused code
- added wip knowledge gap code
- instructions interactive development, either from jupyter notebook or ipython/VS code
Sep 21 2021
Sep 20 2021
- continued analysis, implementing end-to-end pipeline for gender gap
- deep dive with Marc scheduled
- submit spark jobs to yarn via airflow
- submit skein jobs to yarn via airflow
- planning for how to manage airflow code at wmf, https://phabricator.wikimedia.org/T290664
- created wikimedia gitlab repo for ml related research code https://gitlab.wikimedia.org/FabianKaelin/research-ml
No weekly updates.
Sep 13 2021
Submitting a spark job from an airflow instance results in a hadoop/hdfs permission error AccessControlException: Permission denied: user=analytics-research, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x.
No weekly updates.
- first dags run on the research airflow instance
- wip support for deploying spark jobs
- transform into python package for library use
- wip deploy airflow dags
- continuation of analysis to existing pipeline, https://docs.google.com/document/d/1tdU6xHEnkmTffVcaAlHGyGWUeLWuoNlexZ8F6btCN7M/edit#
- spark end-to-end skeleton pipeline
Sep 1 2021
Jun 7 2021
One point of confusion that I have on a conceptual level: currently we apply what we loosely call a k-anonymity threshold, e.g. there would need to be at least k pageviews for a given tuple (time, wiki, page, country) for that tuple to be included in a public dataset. As @Nuria pointed out, we already release pageview counts without geographical dimension, and it would be great to find a way to include e.g. the country. One intuitive concern in the risk that a bad actor can identify users who read pages on a smaller wiki from a country that has few speakers of that language - e.g. a dissident who had to flee their home country. The k-anonymity approach forces us to define what an acceptable risk is (is it 100 pages views for a given page from a given country? 1000?), which is difficult and rather handwavey. However, that risk is not based on plausible deniability - e.g. if a bad actor is looking to identify users who have read a particular page and there were 68 views from a given country, it doesn't seem sufficient to be able to say that 68 is not the "true number" because we applied DP.
May 25 2021
Apologies for the delay - I haven't been running larger avro based jobs recently, and I wasn't able to find a minimal example when I created this task. I am closing this since it is a single exception per job which didn't cause them to fail.
Apr 8 2021
Update on the competition images dataset:
- downloaded 300px thumbnails from swift
- total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
- 6711755 images
- 32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/
Mar 31 2021
Thanks @Ottomata, I can also confirm that the certificates work now too, ie a request with verify=False now fails on the workers as well.
Mar 25 2021
Mar 24 2021
Thanks for the background on where the special:tag info is stored @Milimetric on irc, though since it is in the refined revisions and not the raw revisions table used so far, pulling that in is a bit more work. That said, just looking at the actual revisions that are missing does seem to indicate that multiple specific scenarios don't result in kafka events being created.
After going a little overboard still no easy answers. I did slice and dice the data based on the query that @Milimetric provided above.
Mar 16 2021
messed up the link to the grafana above (edited) and adding it as a screenshot
I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,
Mar 15 2021
I don't think splitting the GPU machines from the yarn cluster is a far fetched idea, especially given the hurdles of making this work with yarn - though I am not familiar with Alluxio. Another option is to create a kubernetes cluster that could make use of these GPUs, which would be in line with the technology stack used by other ML infra projects currently being built (ML platform, search infra). These GPU are a good example of the gap that I perceive in regards the analytics infrastructure and ongoing efforts to build ML infrastructure. I created a doc to discuss the larger question, from the perspective of the research team as a user of ML infrastructure.
I created a separate document to discuss some of the bigger questions around orchestration within analytics that arise from discussing the very specific use case of 'kubeflow on stat machines', any input is much appreciated. On this phab, I would like to continue discussing our short/near term options.
Mar 2 2021
I second Isaac`s comment. I reviewed the gh PR and tested successfully.
Mar 1 2021
Another observation: I attempted to use wmfdata to avoid replicating spark session code. The wmf base conda env contains an older version, and upgrading it fails with
Feb 23 2021
Feb 11 2021
To summarize my understanding:
- for research, the html history is interesting because it expands templates and lua modules
- for a revision of page p created at time t, we prefer to store the html that a reader was served at that time (ie what WikiHist does), rather than the html using the version of the templates at some time in the future (ie by calling the mediawiki api during an batch export)
- however, if at time t+1 a template that is used by page p changed, then the reader was served a different html on wikipedia but there is not any revision for page p. Only once there is an new revision for page p at time t+2 will the change of the template at time t+1 be reflected in the history. In fact page p is not edited after time t, the template change will never be reflected in the html history of the page.
Feb 10 2021
@ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this is described in @tizianopiccardi's paper.
Feb 1 2021
Jan 22 2021
Jan 18 2021
Jan 14 2021
Thanks for the information @fgiunchedi.
Jan 13 2021
@elukey, thanks for the background and for adding my user to Hue - I was able to login.
Jan 12 2021
I am trying to access Hue, and after looking at these tasks requesting access for Hue T271602 and T252703 I am not sure if there is a template task for requesting access? If I understand @elukey 's comment, there is a new'ish way to request UI credentials only, instead of the more involved ssh access. However, I would imagine that most people requesting ssh access will also end up using the UI, so would it make to sense to create the the UI based creds as part of this task as well?
Jan 5 2021
Hi, revisiting this subject! With T220081 the swift cluster is reachable from analytics, does this allow us to proceed with one or both of the options described?
Nov 20 2020
Nov 18 2020
Thanks for the explanation, it makes sense now.
Thanks for the assistance! I am trying to connect, and key seems to have propagated but I see the following error
$ ssh -v bast1002.eqiad.wmnet [snip] debug1: Next authentication method: publickey debug1: Offering public key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent debug1: Server accepts key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent debug1: Authentication succeeded (publickey). Authenticated to bast1002.wikimedia.org ([184.108.40.206]:22). debug1: channel_connect_stdio_fwd bast1002.eqiad.wmnet:22 debug1: channel 0: new [stdio-forward] debug1: getpeername failed: Bad file descriptor debug1: Requesting firstname.lastname@example.org debug1: Entering interactive session. debug1: pledge: network debug1: client_input_global_request: rtype email@example.com want_reply 0 channel 0: open failed: administratively prohibited: open failed stdio forwarding failed kex_exchange_identification: Connection closed by remote host
Nov 16 2020
Also, I noticed that there is an previous outdated entry for me in that yaml file. https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml$1733
Thanks. I did create a separate task for the analytics-privatedata-users group, which seemingly wasn't necessary. https://phabricator.wikimedia.org/T267816