Thu, Apr 8
Update on the competition images dataset:
- downloaded 300px thumbnails from swift
- total 250gb of avro files on hdfs /user/fab/images/competition/all/pixels/
- 6711755 images
- 32200 images couldn't be downloaded (0.48%), /user/fab/images/competition/all/swift_errors/
Wed, Mar 31
Thanks @Ottomata, I can also confirm that the certificates work now too, ie a request with verify=False now fails on the workers as well.
Thu, Mar 25
Wed, Mar 24
Thanks for the background on where the special:tag info is stored @Milimetric on irc, though since it is in the refined revisions and not the raw revisions table used so far, pulling that in is a bit more work. That said, just looking at the actual revisions that are missing does seem to indicate that multiple specific scenarios don't result in kafka events being created.
After going a little overboard still no easy answers. I did slice and dice the data based on the query that @Milimetric provided above.
Tue, Mar 16
messed up the link to the grafana above (edited) and adding it as a screenshot
I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,
Mon, Mar 15
I don't think splitting the GPU machines from the yarn cluster is a far fetched idea, especially given the hurdles of making this work with yarn - though I am not familiar with Alluxio. Another option is to create a kubernetes cluster that could make use of these GPUs, which would be in line with the technology stack used by other ML infra projects currently being built (ML platform, search infra). These GPU are a good example of the gap that I perceive in regards the analytics infrastructure and ongoing efforts to build ML infrastructure. I created a doc to discuss the larger question, from the perspective of the research team as a user of ML infrastructure.
I created a separate document to discuss some of the bigger questions around orchestration within analytics that arise from discussing the very specific use case of 'kubeflow on stat machines', any input is much appreciated. On this phab, I would like to continue discussing our short/near term options.
Mar 2 2021
I second Isaac`s comment. I reviewed the gh PR and tested successfully.
Mar 1 2021
Another observation: I attempted to use wmfdata to avoid replicating spark session code. The wmf base conda env contains an older version, and upgrading it fails with
Feb 23 2021
Feb 11 2021
To summarize my understanding:
- for research, the html history is interesting because it expands templates and lua modules
- for a revision of page p created at time t, we prefer to store the html that a reader was served at that time (ie what WikiHist does), rather than the html using the version of the templates at some time in the future (ie by calling the mediawiki api during an batch export)
- however, if at time t+1 a template that is used by page p changed, then the reader was served a different html on wikipedia but there is not any revision for page p. Only once there is an new revision for page p at time t+2 will the change of the template at time t+1 be reflected in the history. In fact page p is not edited after time t, the template change will never be reflected in the html history of the page.
Feb 10 2021
@ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this is described in @tizianopiccardi's paper.
Feb 1 2021
Jan 22 2021
Jan 18 2021
Jan 14 2021
Thanks for the information @fgiunchedi.
Jan 13 2021
@elukey, thanks for the background and for adding my user to Hue - I was able to login.
Jan 12 2021
I am trying to access Hue, and after looking at these tasks requesting access for Hue T271602 and T252703 I am not sure if there is a template task for requesting access? If I understand @elukey 's comment, there is a new'ish way to request UI credentials only, instead of the more involved ssh access. However, I would imagine that most people requesting ssh access will also end up using the UI, so would it make to sense to create the the UI based creds as part of this task as well?
Jan 5 2021
Hi, revisiting this subject! With T220081 the swift cluster is reachable from analytics, does this allow us to proceed with one or both of the options described?
Nov 20 2020
Nov 18 2020
Thanks for the explanation, it makes sense now.
Thanks for the assistance! I am trying to connect, and key seems to have propagated but I see the following error
$ ssh -v bast1002.eqiad.wmnet [snip] debug1: Next authentication method: publickey debug1: Offering public key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent debug1: Server accepts key: /home/fab/.ssh/wmf_prod ED25519 SHA256:iIFh8ZfJOewuqKKZgStkfmPejgsYEgZC0a9FutV860M explicit agent debug1: Authentication succeeded (publickey). Authenticated to bast1002.wikimedia.org ([220.127.116.11]:22). debug1: channel_connect_stdio_fwd bast1002.eqiad.wmnet:22 debug1: channel 0: new [stdio-forward] debug1: getpeername failed: Bad file descriptor debug1: Requesting email@example.com debug1: Entering interactive session. debug1: pledge: network debug1: client_input_global_request: rtype firstname.lastname@example.org want_reply 0 channel 0: open failed: administratively prohibited: open failed stdio forwarding failed kex_exchange_identification: Connection closed by remote host
Nov 16 2020
Also, I noticed that there is an previous outdated entry for me in that yaml file. https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml$1733
Thanks. I did create a separate task for the analytics-privatedata-users group, which seemingly wasn't necessary. https://phabricator.wikimedia.org/T267816