User Details
- User Since
- Jun 16 2020, 4:12 PM (154 w, 6 d)
- Availability
- Available
- IRC Nick
- balloons
- LDAP User
- Nskaggs
- MediaWiki User
- NSkaggs (WMF) [ Global Accounts ]
Yesterday
I can't find a tool with the specific name of wmr-bot. @wmr do you know the name of your tool within toolforge?
Thu, Jun 1
Note that chart data can be exported as json, excel, image or csv. SQL Lab export seems limited to csv.
The documentation page for PAWS might serve as an inspiration: https://wikitech.wikimedia.org/wiki/PAWS
Wed, May 31
I presume resolving T337848: WMCS-roots access would unblock the situation you described, yes? And if I'm reading correctly, your desire would be to help in WMCS spaces longer term yes? That would point to wmcs-roots as the correct level of access. Additionally, it would be helpful on an ongoing basis in understanding other situations where wmcs-roots access isn't enough to help ensure a WMCS root can manage all WMCS hosts.
For apt support, T336669: Decision request - How to provide a way to install system dependencies for buildpack-based images decided on using https://github.com/heroku/heroku-buildpack-apt. Would this unlock the migration?
Fri, May 26
I added a description and created this sub-project here: https://phabricator.wikimedia.org/project/view/6568/
It's possible you may wish to force both tags to have prefixes or suffices on them. And updating the tags to have public and NDA/restricted prefixes/suffices might make more sense in the longterm rather than attaching to a specific team name or prefix.
@Aklapper https://wikitech.wikimedia.org/wiki/Superset has a disambiguation on it, if that helps. Simply put Data Engineering runs a restricted superset instance; superset.wikimedia.org, while WMCS has launched a public instance at superset.wmcloud.org. The existing tags and documentation only spoke about Data Engineering's restricted instance. With the addition of the public instance, we need to keep them separate and clear. I might update the existing project to say something like
Design, implementation and maintenance of the NDA restricted Wikimedia Superset instance(s), administrated by Data Engineering....
Thu, May 25
Of the options listed, I agree with going with option 2. I would be curious to hear about any other concerns with enabling it unrestricted for anyone using buildpacks. I don't feel the mentioned issue of potential TOU violations warrants the extra work required for option 1. Any other concerns beyond image size?
+1
Wed, May 24
+1 to the request.
+1 to the request.
Mon, May 22
Wed, May 17
I've created https://wikitech.wikimedia.org/w/index.php?title=Superset to provide some more details and potentially hold some basic user documentation for superset. Edits welcome!
WMCS treat every request distinctly. So future exceptions are possible and will be evaluated within their own merit. Sorry if that seems like a boilerplate answer, but for example, if you wish to personally test a non-OSS model in the future, please do ask, with context. Don't assume this specific answer invalidates all future requests as well.
+1 to having a broader discussion. It would be helpful for me to understand more about what the "code" versus "model" looks like.
Much longer debate than should happen on this ticket but, for instance, a model has at least three separate pieces that are often treated independently: the final model artifact (a bunch of numbers essentially), the code used to train the model, and the data that was fed into the model. Right now, the question of AI licenses seems to focus heavily on the final model artifact. The question of whether the training code is open-source is pretty trivial (the code itself is often very basic) but obviously an easy thing to expect. Where we'd get most of the traditional benefits of open (many eyes, ability for anyone to contribute/fork, etc.) is around the underlying data. That's best way to understand how the model was built, where it might fail etc., and make contributions to improving it. For example, for the stableLM example you gave:
- They release the base model artifact as CC-BY-SA 4.0, which would meet our expectations around content licenses.
- They say the underlying dataset is based on The Pile (which is open in the sense that it is freely downloadable though the data within it is variably licensed -- some explicitly open-source, other being incorporated under fair use such as Common Crawl). They don't share what their particular dataset is yet though so I wouldn't consider it open.
- I didn't see their training code shared -- they just say An upcoming technical report will document the model specifications and the training settings. The code in that repo which is Apache 2.0 is just examples of how to use the model.
So from my standpoint, it may seem like stableLM is a better fit because its model artifact (what I would download and use on Cloud VPS) is openly-licensed, but in reality I know far far less about that model than BLOOM or some of the Responsible AI Licensed models that have much more transparency about their process.
So by the pile, it seems you are referring to https://pile.eleuther.ai/. They don't give any licensing information at all. However, https://github.com/EleutherAI/the-pile has some details you are talking about. The code to process the datasets is MIT, but clearly the source data is a myriad of things. Looking further into the paper published about the dataset, answers more technical questions. See https://arxiv.org/pdf/2201.07311.pdf. Quoting from there "All data contained in the Pile has been heavily processed to aid in language modeling research and no copyrighted text is contained in the Pile in its original form. As far as we are aware, under U.S. copyright law use of copyrighted texts in the Pile falls under the “fair use” doctrine, which allows for the unlicensed use of copyright-protected works in some circumstances." I understand you could remix the pile to include / exclude datasets. For example, basing the dataset only on wikipedia and other CC-by-SA licensed content. Based on the description however, I'm not sure what "heavily processed" and "no copyrighted text" really means.
As a random aside, I saw the announce today about stablelm, which is using cc-by-sa for base models: https://github.com/stability-AI/stableLM/#licenses. Specific to the hackathon, can you use an OSI compatible model?
Thanks for mentioning this. Building on above, the reason I requested the exception is not because there aren't open-source models available for some of these tasks but because there's been community interest in models like BLOOM given their collaborative approach to training the model and care around data governance, carbon costs, etc. (i.e. ethics beyond just the question of licensing model artifacts).
Ethical licensing isn't compatible with CC-by-SA which is the issue in question here. Given what you've shared though, I'm curious how anyone can take a mix of data sources and license under any terms really. They aren't the creators of the source content used to create the dataset.
Would GPU access help?
I'd definitely be curious if Cloud VPS is considering incorporating GPUs. For many modern ML models, they are essentially required for reasonable performance.
Let's discuss more about how many GPU's would be needed in order to be meaningful and useful.
Mon, May 15
For docs, upstream has a few. Dashboards are new idea, and can be collaborative, which I think is a useful feature that goes beyond what quarry can offer. I believe people will be reason about a shared set of queries and data, and collaborative edit and view said data via dashboards.
Fri, May 12
Seeing as heroku doesn't seem to have a dotnet buildpack, I'm guessing it would be https://github.com/paketo-buildpacks/dotnet-core.
Thu, May 11
Wed, May 10
For now marking as stalled / blocked on T279023 as noted. I'm doing so because I don't believe any advocates would move forward without the ability to automatically subscribe folks to the correct mailing list.
May 2 2023
@Jclark-ctr These should be setup with software RAID just like last time. See @Andrew comment: https://phabricator.wikimedia.org/T294972#8029219. @Andrew feel free to correct or jump in if needed.
Apr 28 2023
Apr 26 2023
Done. Updated CPU and deployment limits to 5 each. If you need more resources, don't hesitate to ask. My apologies for the delay in processing. Thanks for helping the migration from grid engine!
Apr 24 2023
This should now be created, with qgil as the sole admin. Feel free to add more as needed.
Apr 20 2023
As a random aside, I saw the announce today about stablelm, which is using cc-by-sa for base models: https://github.com/stability-AI/stableLM/#licenses. Specific to the hackathon, can you use an OSI compatible model? Would GPU access help?
Apr 19 2023
+1 to having a broader discussion. It would be helpful for me to understand more about what the "code" versus "model" looks like.
Apr 14 2023
Thanks for opening this ticket! I've added this to the agenda for next week's team meeting for consideration. In the meantime, comments welcome!
Apr 11 2023
Apr 7 2023
@whym Thanks for migrating! I will resolve this task for you given the tool is now running on k8s.
Thanks for sharing this idea Taavi. If we allow a tool to be restarted as part of a transition to a different cluster does that change the requirements mentioned? In general, are there other things we can do to simplify the list of requirements? For example, if some downtime was ok, does that change anything? If users weren't blocked on using kubectl, but also had no support if a k8s upgrade broke them, does that lessen requirements in a useful way? What if we deployed a clone of toolforge today, but did so with a current version of k8s? Could tools be migrated?
I think marking for deletion is enough. Thanks for your help!
Apr 5 2023
Note, WMCS would like to migrate directly to bookworm if possible.
As an update, this is now blocked on T297596: have cloud hardware servers in the cloud realm using a dedicated LB layer. The previous implementation discussion led to a finalization of guidelines, which are now published in https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines#Case_4:_cloud-dedicated_hardware. This implementation will be conformant with those guidelines, which requires T297596, cloud based load balancers to exist. The network requirements are likely to shift to a new cloud-private + cloud-hosts VLAN setup, but otherwise I don't anticipate any changes required related to network or racking.
Apr 4 2023
+1