Tue, Jul 19
Jun 6 2022
We went ahead and manually created a knowledge_gaps database in hive. We haven't verified, but assuming that the analytics-research user can read&write from this database, I will close this ticket and reopen if necessary.
May 5 2022
@diego is this issue resolved? Instances not upgraded by 2022-06-01 may be subject to deletion unless prior arrangements for an extended deadline has been approved by the Cloud VPS administration team.
Feb 16 2022
@MatthewVernon - ping regarding the question above about how to access file objects from publicly
Feb 9 2022
This would affect the research team, especially if the stat machines are also included in this restriction. For example:
- pip / conda install (python packages)
- github / gitlab / gerrit (code)
- downloading/uploading datasets, e.g. figshare, zenodo, and a long tail of others
- libraries that depend on the web, e.g. when working with pre-trained models in tensorflow hub/huggingface
- APIs (mediawiki / toolforge / cloud VPS)
Feb 1 2022
Thank you @brennen for setting this up so quickly.
Jan 5 2022
Dec 6 2021
And one more question:
Dec 2 2021
At first we would like to use the swift credentials from yarn containers, both from spark and skein based applications. This will mostly used for write operations using the S3 protocol.
Nov 30 2021
Thank you for the updates.
Nov 18 2021
That is interesting - this would be a stopgap until data eng is upgrading to a puppetized spark3? I imagine so, since there are no-conda prod use cases of spark.
Nov 10 2021
Thanks for getting this started, no worries about the delay for review.
Nov 8 2021
That sounds good to me. Thank you!
Nov 4 2021
Thank you for the reply!
Nov 2 2021
A lot of thoughtful comments have been made, and like others I find it difficult to separate the technical options/trade-offs from the problem statement itself. There seems to be a consensus that a reliable events architecture is desirable, but not if we are willing to pay a price in reliability for it. Some points we could include to make the problem statement more "decidable":
- a commitment to create a core event infrastructure that is a source of truth for which we are willing to pay a price in reliability (and engineering resources). That price is determined by many factors yet to be determined/discussed, and should be "configurable" by SRE. I.e. there will be some sort of coupling between the transactions to the MW DB and the event system; if one transaction fails the other gets a chance to react to it. This is difficult to discuss without getting technical, but without coupling there is no events with guarantees, and we can't have coupling without a price in reliability.
- explicitly exclude as goal to make events as a source of truth for MW itself. This seems too risky/ambitious, it could still be done more incrementally in the future once events are a source of truth.
Nov 1 2021
Oct 26 2021
Oct 13 2021
The exploratory work is completed, we are moving to implementing the first two gaps in T293273 for Q2.
- skein based operator that deploys a key/value store on demand using tikv
- generic spark job airflow dag template
- configurable airflow dag to download media from swift
no updates for last week
Oct 10 2021
I updated the description with a link to the notebook. Please let us know if you run into problems.
Sep 27 2021
- code consolidated in https://gitlab.wikimedia.org/FabianKaelin/research-ml/-/tree/fab/knowledge_gaps/research-transform/research_transform/knowledge_gaps
- promising initial comparison between metrics generated by the spark pipeline vs existing code
- cross team discussions for how to package/distribute python dependencies, which will enable us to submit spark jobs in cluster mode
- first wip ML pipeline that uses spark, bash and skein operators in single dag
- added research-transform python package for reusable research focused code
- added wip knowledge gap code
- instructions interactive development, either from jupyter notebook or ipython/VS code
Sep 21 2021
Sep 20 2021
- continued analysis, implementing end-to-end pipeline for gender gap
- deep dive with Marc scheduled
- submit spark jobs to yarn via airflow
- submit skein jobs to yarn via airflow
- planning for how to manage airflow code at wmf, https://phabricator.wikimedia.org/T290664
- created wikimedia gitlab repo for ml related research code https://gitlab.wikimedia.org/FabianKaelin/research-ml
No weekly updates.
Sep 13 2021
Submitting a spark job from an airflow instance results in a hadoop/hdfs permission error AccessControlException: Permission denied: user=analytics-research, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x.
No weekly updates.
- first dags run on the research airflow instance
- wip support for deploying spark jobs
- transform into python package for library use
- wip deploy airflow dags
- continuation of analysis to existing pipeline, https://docs.google.com/document/d/1tdU6xHEnkmTffVcaAlHGyGWUeLWuoNlexZ8F6btCN7M/edit#
- spark end-to-end skeleton pipeline
Sep 1 2021
Jun 7 2021
One point of confusion that I have on a conceptual level: currently we apply what we loosely call a k-anonymity threshold, e.g. there would need to be at least k pageviews for a given tuple (time, wiki, page, country) for that tuple to be included in a public dataset. As @Nuria pointed out, we already release pageview counts without geographical dimension, and it would be great to find a way to include e.g. the country. One intuitive concern in the risk that a bad actor can identify users who read pages on a smaller wiki from a country that has few speakers of that language - e.g. a dissident who had to flee their home country. The k-anonymity approach forces us to define what an acceptable risk is (is it 100 pages views for a given page from a given country? 1000?), which is difficult and rather handwavey. However, that risk is not based on plausible deniability - e.g. if a bad actor is looking to identify users who have read a particular page and there were 68 views from a given country, it doesn't seem sufficient to be able to say that 68 is not the "true number" because we applied DP.
May 25 2021
Apologies for the delay - I haven't been running larger avro based jobs recently, and I wasn't able to find a minimal example when I created this task. I am closing this since it is a single exception per job which didn't cause them to fail.
Apr 8 2021
Update on the competition images dataset:
- downloaded 300px thumbnails from swift
- total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
- 6711755 images
- 32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/
Mar 31 2021
Thanks @Ottomata, I can also confirm that the certificates work now too, ie a request with verify=False now fails on the workers as well.
Mar 25 2021
Mar 24 2021
Thanks for the background on where the special:tag info is stored @Milimetric on irc, though since it is in the refined revisions and not the raw revisions table used so far, pulling that in is a bit more work. That said, just looking at the actual revisions that are missing does seem to indicate that multiple specific scenarios don't result in kafka events being created.
After going a little overboard still no easy answers. I did slice and dice the data based on the query that @Milimetric provided above.
Mar 16 2021
messed up the link to the grafana above (edited) and adding it as a screenshot
I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,
Mar 15 2021
I don't think splitting the GPU machines from the yarn cluster is a far fetched idea, especially given the hurdles of making this work with yarn - though I am not familiar with Alluxio. Another option is to create a kubernetes cluster that could make use of these GPUs, which would be in line with the technology stack used by other ML infra projects currently being built (ML platform, search infra). These GPU are a good example of the gap that I perceive in regards the analytics infrastructure and ongoing efforts to build ML infrastructure. I created a doc to discuss the larger question, from the perspective of the research team as a user of ML infrastructure.
I created a separate document to discuss some of the bigger questions around orchestration within analytics that arise from discussing the very specific use case of 'kubeflow on stat machines', any input is much appreciated. On this phab, I would like to continue discussing our short/near term options.
Mar 2 2021
I second Isaac`s comment. I reviewed the gh PR and tested successfully.
Mar 1 2021
Another observation: I attempted to use wmfdata to avoid replicating spark session code. The wmf base conda env contains an older version, and upgrading it fails with
Feb 23 2021
Feb 11 2021
To summarize my understanding:
- for research, the html history is interesting because it expands templates and lua modules
- for a revision of page p created at time t, we prefer to store the html that a reader was served at that time (ie what WikiHist does), rather than the html using the version of the templates at some time in the future (ie by calling the mediawiki api during an batch export)
- however, if at time t+1 a template that is used by page p changed, then the reader was served a different html on wikipedia but there is not any revision for page p. Only once there is an new revision for page p at time t+2 will the change of the template at time t+1 be reflected in the history. In fact page p is not edited after time t, the template change will never be reflected in the html history of the page.
Feb 10 2021
@ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this is described in @tizianopiccardi's paper.
Feb 1 2021
Jan 22 2021
Jan 18 2021
Jan 14 2021
Thanks for the information @fgiunchedi.
Jan 13 2021
@elukey, thanks for the background and for adding my user to Hue - I was able to login.
Jan 12 2021
I am trying to access Hue, and after looking at these tasks requesting access for Hue T271602 and T252703 I am not sure if there is a template task for requesting access? If I understand @elukey 's comment, there is a new'ish way to request UI credentials only, instead of the more involved ssh access. However, I would imagine that most people requesting ssh access will also end up using the UI, so would it make to sense to create the the UI based creds as part of this task as well?
Jan 5 2021
Hi, revisiting this subject! With T220081 the swift cluster is reachable from analytics, does this allow us to proceed with one or both of the options described?
Nov 20 2020
Nov 18 2020
Thanks for the explanation, it makes sense now.
Thanks for the assistance! I am trying to connect, and key seems to have propagated but I see the following error
$ ssh -v bast1002.eqiad.wmnet [snip] debug1: Next authentication method: publickey debug1: Offering public key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent debug1: Server accepts key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent debug1: Authentication succeeded (publickey). Authenticated to bast1002.wikimedia.org ([184.108.40.206]:22). debug1: channel_connect_stdio_fwd bast1002.eqiad.wmnet:22 debug1: channel 0: new [stdio-forward] debug1: getpeername failed: Bad file descriptor debug1: Requesting email@example.com debug1: Entering interactive session. debug1: pledge: network debug1: client_input_global_request: rtype firstname.lastname@example.org want_reply 0 channel 0: open failed: administratively prohibited: open failed stdio forwarding failed kex_exchange_identification: Connection closed by remote host
Nov 16 2020
Also, I noticed that there is an previous outdated entry for me in that yaml file. https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml$1733
Thanks. I did create a separate task for the analytics-privatedata-users group, which seemingly wasn't necessary. https://phabricator.wikimedia.org/T267816