Page MenuHomePhabricator

Supporting AI, LLM, and data models on WMCS
Open, LowPublic

Description

Given the concerns around licensing of data models raised on T333856: Cloud VPS open exception request, let's discuss how best to support and enable work in this space in support of the Wikimedia movement on Wikimedia Cloud Service Offerings. I believe it is in the health and interest of the Wikimedia movement to support this emerging field, and to utilize our existing platforms and resources, namely WMCS, to do so.

However, some questions remain:

Licensing

  • Given the lack of clear licensing and/or incompatible with OSI licensing, what can users run on WMCS? Is any work being done in the data community to create OSI compatible models? Can something like The Pile, which is free with no other apparent license, be used on WMCS?
  • According to @Isaac, T333856#8805764, "a model has at least three separate pieces that are often treated independently: the final model artifact (a bunch of numbers essentially), the code used to train the model, and the data that was fed into the model". Interpreting this, what requirements are imposed by WMCS upon each piece of the model? Must each piece used also be OSI-compatible? If not, which pieces must be? Is this scenario similar to utilizing non-free hardware and software to create an image which is then openly licensed?
  • What requirements does WMCS impose on non-code objects stored in WMCS? Are models code? If not, then what requirements should we place on them?

Hardware Requirements

  • How much disk, RAM, CPU might be needed? Can we meet those needs with our existing hardware?
  • Are GPUs required? If so, how many? How would access be controlled?

Non-blocking questions

  • What projects exist that wish to explore these fields? What goals / outcomes do they have?
  • Are multiple independent projects needed, created on request by any party? Or could the collective work be consolidated into a few primary projects?

The goal of this ticket is to discuss and collect feedback on the listed questions. In addition to update wikitech, WMCS policies, etc as required in accordance with any decisions made.

Event Timeline

Sorry for the delay on commenting on this but thanks for putting together the task @nskaggs ! Adding @calbon too so he's aware and give thoughts on how to possibly balance the allowances of LiftWing and Cloud Services in this space.

Around licensing: there's some on-going discussions at OSI that I'm aware of around what "open" means for AI models that I look forward to seeing the outcome of that. We'll have to decide then whether their decision meets our needs / values but I don't want to guess at that too much just yet. cc @SSpalding-WMF so you're aware of this need for clarifying AI licensing too.

Around hardware requirements: one outcome from the Hackathon and discussions around AI on cloud services is that the main value that Cloud Services likely can play in this arena is making it easy to prototype new models. Training models is pretty computationally-intense and hard to imagine being well-supported. Even hosting some of the mid-sized language models for purely inference/serving has gotten to a stage where you pretty much need GPUs for them to be reasonably performant. But Chris will have much more informed thoughts on what this could look like.

Around use-cases: we're working to gather some common use-cases for AI models so that we can provide them as services (rather than expecting each person who wants to use them to use an external API or figure out how to self-host). The first one is around text summarization (T342614) led by @MGerlach. Obviously not everything can be covered in this way but surfacing these use-cases and building general purpose APIs for them is a nice potential complement to providing more individual hardware.

Can something like The Pile, which is free with no other apparent license, be used on WMCS?

The Pile contains dataset "books3" which is collection of out-of-copyright books from Project gutemberg, but also a huge number of pirated ebooks. So least for books3 part it cannot be used.

Next question is that can we use any of the models which are trained using Pile / books3?

fnegri subscribed.

There's no pending discussion at the moment, so I'm moving this task out of "Needs discussion" column and back to the inbox. Feel free to leave a comment if you would like this to be prioritized.

How much disk, RAM, CPU might be needed? Can we meet those needs with our existing hardware?
Are GPUs required? If so, how many? How would access be controlled?

Maybe FYI @mfossati?

Given the lack of clear licensing and/or incompatible with OSI licensing, what can users run on WMCS?

The OSI has published The Open Source AI Definition – 1.0 (OSAID). Their FAQ includes a section on known compliant systems. It lists 5 that have passed their Validation phase of analysis:

The FAQ also states:

These results should be seen as part of the definitional process, a learning moment, they're not certifications of any kind. OSI will continue to validate only legal documents, and will not validate or review individual AI systems, just as it does not validate or review software projects.

I take this to mean that there are currently no systems which are certified to meet the OSAID.

@bd808 for posterity, a larger list is also available at https://github.com/eugeneyan/open-llms and maybe one of them is or at some point would be OSI compatible.

Some of the so-called open LLMs have a questionable license clause about non-competition. For instance, the LLAMA license https://github.com/meta-llama/llama/blob/main/LICENSE has "You will not use the Llama Materials or any output or results of the
Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof)." I think such clauses are show-stoppers in a Wikimedia context. Much of Wikimedia content is used to train LLMs and output from an "Wikimedia-LLM" should have no such restrictions.

@Isaac I know you are following this space quite closely – any new thoughts since your comment from 2023?

How much disk, RAM, CPU might be needed? Can we meet those needs with our existing hardware?
Are GPUs required? If so, how many? How would access be controlled?

Maybe FYI @mfossati?

Off the top of my head: hardware requirements strongly depend on which models we'd like to train here. The lower bound is a few MBs of disk space, CPUs, and less than one GB of RAM for the simplest models. On the other hand, I don't think it's realistic to train full-fledged LLMs within the Cloud Services infrastructure, as they need huge computational resources.
The thoughts account for training models, while serving them is a totally different story.

Their FAQ includes a section on known compliant systems. It lists 5 that have passed their Validation phase of analysis:
...
I take this to mean that there are currently no systems which are certified to meet the OSAID.

@bd808 thanks for surfacing this and @Slst2020 for the ping. Adding some of my thoughts: I don't think they'll ever "certify" models as meeting OSAID because it's not as simple as just adding an appropriate license but requires meeting a much larger set of transparency requirements. Those five models are a reasonable starting point and I would hope acceptable for WMCS, but folks will reasonably want to use newer/different models depending on their use-cases. For example, out of the five, only T5 is multilingual and even then it's just English, French, Romanian, and German. The open requirement on WMCS is particularly challenging (not necessarily unreasonable just calling out the difficulty to follow at this stage) because while it's easy to say what doesn't meet the requirements (because trivially their licenses are not open as mentioned by a few folks above), it's much harder to say what models with Apache, MIT, etc. licenses also meet the remaining OSAID requirements. For instance, Mixtral has an Apache licenses but is explicitly called out in that FAQ as not meeting OSAID requirements, presumably because their training data isn't disclosed. I think if WMCS is serious about supporting AI models, some easy process for certifying new models will likely be needed as I'm not aware of any official external list.

Off the top of my head: hardware requirements strongly depend on which models we'd like to train here. The lower bound is a few MBs of disk space, CPUs, and less than one GB of RAM for the simplest models. On the other hand, I don't think it's realistic to train full-fledged LLMs within the Cloud Services infrastructure, as they need huge computational resources.
The thoughts account for training models, while serving them is a totally different story.

+1 to what @mfossati says. I think training opportunities will be extremely limited unfortunately. Probably limited to fine-tuning the smaller models (low hundreds of millions of parameters) because the memory requirements are generally much higher than for serving though that space is developing rapidly. As technology progresses, serving pretty good models with relatively normal amounts of CPU is become better too. Already, many of the sentence-transformer models are reasonable to run for inference on WMCS and great for a variety of tasks related to search (though I'm not sure how many would meet OSAID's requirements). @Slst2020 and I demonstrated how to create a natural-language search interface for Wikitech at the 2023 Hackathon (T333853) with one of these models and similar tools could be created for e.g., Wikipedia Policy/Help documentation, Quarry queries, SPARQL example queries, or I'm sure other aspects of the Wikimedia ecosystem. I haven't tried, but I suspect classification models like the multilingual revert-risk model (vandalism detection) could also be reasonably hosted on WMCS though the latency likely won't be good enough for real-time use-cases. FWIW, my guess is that the underlying multilingual BERT model used by revert-risk does meet OSAID requirements as it seems to be trained just on Wikimedia dumps and they release a ton of supporting code but I didn't verify each requirement.

May I offer a different perspective? While it is pretty clear that we want "programs" run on WMCS to meet OSI requirements, it doesn't have to be the case that the AI model itself would run on WMCS. We are using other capabilities in Wikimedia projects that relies on external resources that are not OSI compatible. For instance, we use Google Images and TinEye to perform reverse images searches. The piece of code that refers the user to them is on WM projects (a JS in MediaWiki namespace on Commons) and meets OSI, but the underlying service doesn't.

We also purchase commercial data and use it in our projects. For instance, the geodata for IP addresses is retrieved by querying MaxMind data, which is not free data.

So, is there a reason to think that WM could not, for instance, purchase a subscription to Azure OpenAI, write OSI compatible code to create a "wrapper" for it on WMCS, and allow WMCS users a certain amount of use of such API for performing inference tasks? This would of course be limited and throttled, and larger use cases would required approval.

I keep occasionally getting pinged about this general topic on fawiki. Various users there are envisioning a lot of value from having LLMs helped with translation, template editing, etc. Has there been any progress here? Has the OSI certified any of the open models?

I found this repo with a list of llms and their licenses https://github.com/eugeneyan/open-llms/blob/main/README.md
I don't believe we should wait for the top of the line open-ish models to become OSI certified. We can easily pick some of the weaker models that use OSI-certified licenses and work with those. It doesn't have to be the best, it just needs to be useful. And we all agree the community wants this.