Page MenuHomePhabricator

Cloud VPS open exception request
Closed, DeclinedPublic

Description

As part of the upcoming May hackathon, we'd like to showcase some AI models that tool developers might think about using in their work. While some models are unequivocally open (code is open-source; model is also explicitly open-source), other models that we are curious about have some use restrictions for the models -- e.g., non-commercial clauses (example) or ethical use restrictions such as not using for the purposes of creating disinformation (example). These models do not currently meet the OSI definitions but OSI's recent report on this indicates there is active discussion about what Open and AI should look like. While I don't think we're ready to take a stance on whether these use restrictions are acceptable for our systems, I do think they qualify as a reasonable temporary exception for the purpose of testing / exploring for the hackathon.

Why I think it qualifies for an exception:

  • We still would only be using OSI-licensed code (MIT, Apache, etc.) -- this is just about the model weights which are the more debatable area.
  • This is a temporary usage for purposes of the Hackathon and I'd be happy to discuss a mandatory removal date after the hackathon.
  • The usage is not for supporting a tool used on-wiki at this moment etc. but for the purposes of exploring model outputs -- e.g., akin to testing compatibility with developer needs.

Miscellaneous:

Event Timeline

Thanks for opening this ticket! I've added this to the agenda for next week's team meeting for consideration. In the meantime, comments welcome!

Non-Commercial and ethical use restrictions violate the "No discrimination against fields of endeavor" clause of the The Open Source Definition.

Thanks for opening this ticket! I've added this to the agenda for next week's team meeting for consideration.

Thanks @nskaggs -- don't hesitate to let me know if any additional details would be useful for folks to know. FYI I'll be out the latter half of this week so if you have any clarifying questions, I might not get back to you until next week.

Non-Commercial and ethical use restrictions violate the "No discrimination against fields of endeavor" clause of the The Open Source Definition.

Thanks @bd808 for pulling up the specific clause that I'm asking for an exemption from.

We met about this today and are not going to approve this exception -- you'll need to find another cloud to run this stuff if you need to run it.

There are a few reasons for why we're declining this but the main one is wanting to ensure that everyone (both in the WMF and outside) has the same experience on our platform. If someone wants to open up a discussion about varying our FOSS policies to include these newfangled 'Open Source except for XXX' licenses I'd be interested in having that discussion, but it will need to happen in the general case rather than around one particular product.

Thank you for asking :)

+1 to having a broader discussion. It would be helpful for me to understand more about what the "code" versus "model" looks like.

However, I looked quickly at https://bigscience.huggingface.co/blog/the-bigscience-rail-license. This ethical licensing feels a bit like the same understanding the creative commons, BSD, and even the Wikimedia community has to come to. Someone might reuse your work in ways you don't like; be it commercially or otherwise. Attempting to restrict behavior around ethics is likely both a losing endeavor and a short sighted one for licensing. It defeats the purpose of being open, and ultimately prevents collaboration. Speaking only for myself, as "nice" as the intention might seem, I don't think it fits with the CC-by-SA and open source licensing methodology the movement has adopted as a whole.

As a random aside, I saw the announce today about stablelm, which is using cc-by-sa for base models: https://github.com/stability-AI/stableLM/#licenses. Specific to the hackathon, can you use an OSI compatible model? Would GPU access help?

Thanks for response and additional engagement. I don't expect the conclusion to change but some additional context / thoughts:

There are a few reasons for why we're declining this but the main one is wanting to ensure that everyone (both in the WMF and outside) has the same experience on our platform.

I'll admit that I'm a bit confused about this reasoning. The purpose of an exception is to carve out a clear instance where it's okay for the experience to vary, such as testing compatibility with proprietary software as mentioned in the current Terms of use. I thought I had crafted this request to match that suggested exception pretty closely (time-bound; for the purposes of experimentation/testing). What I'm hearing though is that exceptions to the current open-source definition are not being considered without a broader discussion?

+1 to having a broader discussion. It would be helpful for me to understand more about what the "code" versus "model" looks like.

Much longer debate than should happen on this ticket but, for instance, a model has at least three separate pieces that are often treated independently: the final model artifact (a bunch of numbers essentially), the code used to train the model, and the data that was fed into the model. Right now, the question of AI licenses seems to focus heavily on the final model artifact. The question of whether the training code is open-source is pretty trivial (the code itself is often very basic) but obviously an easy thing to expect. Where we'd get most of the traditional benefits of open (many eyes, ability for anyone to contribute/fork, etc.) is around the underlying data. That's best way to understand how the model was built, where it might fail etc., and make contributions to improving it. For example, for the stableLM example you gave:

  • They release the base model artifact as CC-BY-SA 4.0, which would meet our expectations around content licenses.
  • They say the underlying dataset is based on The Pile (which is open in the sense that it is freely downloadable though the data within it is variably licensed -- some explicitly open-source, other being incorporated under fair use such as Common Crawl). They don't share what their particular dataset is yet though so I wouldn't consider it open.
  • I didn't see their training code shared -- they just say An upcoming technical report will document the model specifications and the training settings. The code in that repo which is Apache 2.0 is just examples of how to use the model.

So from my standpoint, it may seem like stableLM is a better fit because its model artifact (what I would download and use on Cloud VPS) is openly-licensed, but in reality I know far far less about that model than BLOOM or some of the Responsible AI Licensed models that have much more transparency about their process.

Some additional reading about other aspects here and here.

As a random aside, I saw the announce today about stablelm, which is using cc-by-sa for base models: https://github.com/stability-AI/stableLM/#licenses. Specific to the hackathon, can you use an OSI compatible model?

Thanks for mentioning this. Building on above, the reason I requested the exception is not because there aren't open-source models available for some of these tasks but because there's been community interest in models like BLOOM given their collaborative approach to training the model and care around data governance, carbon costs, etc. (i.e. ethics beyond just the question of licensing model artifacts).

Would GPU access help?

I'd definitely be curious if Cloud VPS is considering incorporating GPUs. For many modern ML models, they are essentially required for reasonable performance.

Thanks for response and additional engagement. I don't expect the conclusion to change but some additional context / thoughts:

There are a few reasons for why we're declining this but the main one is wanting to ensure that everyone (both in the WMF and outside) has the same experience on our platform.

I'll admit that I'm a bit confused about this reasoning. The purpose of an exception is to carve out a clear instance where it's okay for the experience to vary, such as testing compatibility with proprietary software as mentioned in the current Terms of use. I thought I had crafted this request to match that suggested exception pretty closely (time-bound; for the purposes of experimentation/testing). What I'm hearing though is that exceptions to the current open-source definition are not being considered without a broader discussion?

WMCS treat every request distinctly. So future exceptions are possible and will be evaluated within their own merit. Sorry if that seems like a boilerplate answer, but for example, if you wish to personally test a non-OSS model in the future, please do ask, with context. Don't assume this specific answer invalidates all future requests as well.

+1 to having a broader discussion. It would be helpful for me to understand more about what the "code" versus "model" looks like.

Much longer debate than should happen on this ticket but, for instance, a model has at least three separate pieces that are often treated independently: the final model artifact (a bunch of numbers essentially), the code used to train the model, and the data that was fed into the model. Right now, the question of AI licenses seems to focus heavily on the final model artifact. The question of whether the training code is open-source is pretty trivial (the code itself is often very basic) but obviously an easy thing to expect. Where we'd get most of the traditional benefits of open (many eyes, ability for anyone to contribute/fork, etc.) is around the underlying data. That's best way to understand how the model was built, where it might fail etc., and make contributions to improving it. For example, for the stableLM example you gave:

  • They release the base model artifact as CC-BY-SA 4.0, which would meet our expectations around content licenses.
  • They say the underlying dataset is based on The Pile (which is open in the sense that it is freely downloadable though the data within it is variably licensed -- some explicitly open-source, other being incorporated under fair use such as Common Crawl). They don't share what their particular dataset is yet though so I wouldn't consider it open.
  • I didn't see their training code shared -- they just say An upcoming technical report will document the model specifications and the training settings. The code in that repo which is Apache 2.0 is just examples of how to use the model.

So from my standpoint, it may seem like stableLM is a better fit because its model artifact (what I would download and use on Cloud VPS) is openly-licensed, but in reality I know far far less about that model than BLOOM or some of the Responsible AI Licensed models that have much more transparency about their process.

Some additional reading about other aspects here and here.

So by the pile, it seems you are referring to https://pile.eleuther.ai/. They don't give any licensing information at all. However, https://github.com/EleutherAI/the-pile has some details you are talking about. The code to process the datasets is MIT, but clearly the source data is a myriad of things. Looking further into the paper published about the dataset, answers more technical questions. See https://arxiv.org/pdf/2201.07311.pdf. Quoting from there "All data contained in the Pile has been heavily processed to aid in language modeling research and no copyrighted text is contained in the Pile in its original form. As far as we are aware, under U.S. copyright law use of copyrighted texts in the Pile falls under the “fair use” doctrine, which allows for the unlicensed use of copyright-protected works in some circumstances." I understand you could remix the pile to include / exclude datasets. For example, basing the dataset only on wikipedia and other CC-by-SA licensed content. Based on the description however, I'm not sure what "heavily processed" and "no copyrighted text" really means.

As a random aside, I saw the announce today about stablelm, which is using cc-by-sa for base models: https://github.com/stability-AI/stableLM/#licenses. Specific to the hackathon, can you use an OSI compatible model?

Thanks for mentioning this. Building on above, the reason I requested the exception is not because there aren't open-source models available for some of these tasks but because there's been community interest in models like BLOOM given their collaborative approach to training the model and care around data governance, carbon costs, etc. (i.e. ethics beyond just the question of licensing model artifacts).

Ethical licensing isn't compatible with the existing WMCS policy (which dictates OSI approval licensing) which is the issue in question here. Given what you've shared though, I'm curious how anyone can take a mix of data sources and license under any terms really. They aren't the creators of the source content used to create the dataset.

Would GPU access help?

I'd definitely be curious if Cloud VPS is considering incorporating GPUs. For many modern ML models, they are essentially required for reasonable performance.

Let's discuss more about how many GPU's would be needed in order to be meaningful and useful.

I want to continue discussing, but let's open a new ticket and discuss how to best support you and community's needs for AI, LLM, etc on CloudVPS. I personally want to support and encourage these workloads if possible.