Page MenuHomePhabricator

[WE1.2.4] Detecting Peacock behavior with LLMs
Closed, ResolvedPublic

Description

Hypothesis: "If we train an LLM on detecting peacock behavior, then we can learn if it can detect this policy violation with at least >70% precision and >50% recall and ultimately, decide if said LLM is effective enough to power a new Edit Check and/or Suggested Edit."

Conclusion

Confirm if the hypothesis was supported or contradicted

The hypothesis was contradicted because LLMs were not able to provide the precision and recall expected. With other AI-based models (i.e. smaller language models) we were close to reach the target precision & recall, but still they were the numbers were below the thresholds.

Briefly describe what was accomplished over the course of the hypothesis work.

  • We tested two LLMs (AYA23 and Gemma) for a "peacock detection task". We tried a set of prompts and learning strategies (zero-shot, few-shot).
  • Specifically, we tested on 522 examples (261 positive and 261 negative).
  • The best result for an LLM, was for AYA23 on a zero-shot approach, obtaining a precision of 0.54 and recall equal to 0.24
  • We saw that although LLMs capture a signal (they work better than random), the precision & recall were low compared with simpler models such as smaller Language Models (BERT-like).
  • The baseline model, a fine-tuned Bert Model, was tested with the same data, and got a precision 0.72 and recall 0.4. Much better than the LLM approach, but still under the our target.

Major lessons

  • LLMs are promising technology, that makes simple to build a AI-based model, just with a simple prompt, the LLM was able to capture a signal on "promotional language" on Wikipedia article.
  • However, the precision and recall this type of tasks, is still not enough for creating a reliable product.
  • On the other hand, more established technologies, such as BERT, requires more work on training a model, but gives better results.
  • For future, when considering creating AI-based models for policy violation detection, these two approaches should be tested.
  • LLMs offers a lot of flexibility, and can work on scenarios without much training data, and potentially one LLM could be used to work on several policies/taks.
  • On the other hand, smaller models offers higher accuracy and significantly less computation resources, but requires training data and as well might imply a major maintenance efforts because they will be task/policy specific.

Event Timeline

leila renamed this task from Detecting Peacook behavior with LLMs to [W.E.1.2.4] Detecting Peacook behavior with LLMs.Jul 2 2024, 10:00 PM
leila triaged this task as High priority.

Based on our previous research, we have created a dataset containing 9276 articles affected by peacock and other related policy violations on English Wikipedia. For each of them we have negative (no policy violations) and positive examples: * Autobiography: 1472

  • fanpov: 350
  • peacock 2587
  • weasel 805
  • advert: 4062
  • Total: 9276

Also, reviewing the latest literature on the field, we have found a recent study on detecting No Neutral Point of View (NPOV) in Wikipedia. Researchers from the University of Michigan tested ChatGPT 3.5, Mistral-Medium, and GPT-4 for detecting NPOV, finding a poor performance. By testing different strategies of prompt engineering, they just were able to reach 64% of accuracy.

  • This results shows the limitations of LLMs for detecting Wikipedia Policy violations.
  • Nonetheless, it is important to highlight that our focus is on a (potentially) simpler policy, peacock behavior.
  • Notice that the experiments on this paper were done using prompt engineering, while in our case, we should explore fine tuning an LLM.
diego renamed this task from [W.E.1.2.4] Detecting Peacook behavior with LLMs to [W.E.1.2.4] Detecting Peacock behavior with LLMs.Jul 12 2024, 3:53 PM
  • Studied how to create prompts for Gemma2. Noticed the importance of using special tokens and format.
  • Designed zero-shot experiment for detecting Peacock behavior.
  • Wrote code for testing the Gemma2 instance hosted by the ML-team.
    • The instance took more than 5 seconds per query.
    • After few requests (around 200) the instance stop responding.
    • O've reported this issue to ML-Team, my understanding is they will be working on fixing this during the next week (cc: Chris Albon)

Progress update

  • I've been coordinating with ML-team to show code examples that make their (experimental) infrastructure to fail. They will be using this code as part of their use-case studies when testing new LLMs infrastructure.
  • In the meantime I've been working on writing code to fine-tune smaller Language Models, this requires:
    • Data preprocessing and cleaning (done)
    • Experimental design (done)
    • Run experiments on stats machine (in progress)
  • Met with KR owner (Peter Pelberg) and explain the progress and next steps for this hypothesis.

Any new metrics related to the hypothesis

  • No

Any emerging blockers or risks

  • There is certain "congestion" with the GPUs on stat machine (many users for few GPUs). This requires waiting until GPUs are free. Having access to new GPUs would help to work faster on this front

Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?

  • No

Have there been any new lessons from the hypothesis?

  • Not this week

Have there been any changes to the hypothesis scope or timeline?

  • No

Progress update

  • Fine-tune model:
    • I've tested the fine-tune approach, creating an classifier based on an smaller Language Model. I used BERT, because we already have other products hosted in Liftwing based on this model, and it has shown that is fast enough and scale well on our existing infrastructure.
    • I've run several experiments, testing different datasets and model configuration. The (best) results expressed on Precision and Recall on a balanced dataset (same number of articles with and without peacock behavior) were:
      • Precision: 0.67
      • Recall: 0.15
    • These results are bellow our target, but should be consider as a baseline to compare with the LLMs experiments.
    • Depending on the results of those experiments, we should consider trying to improve the fine-tuning approach, because these numbers shows that the model is learning (finding a signal on the data), and probably with some tweaks we could (significantly) improve it's performance.
  • The ML-team is experimenting with a new LLM, called AYA23, I've done a quick test, and the service seems to be fast and robust enough to run experiments on it.

Any new metrics related to the hypothesis

  • Fine-tuned (BERT) model performance on detecting peacock behavior :
Training data: 9213 cases (unbalanced)
Validation data: 522 cases (balanced)
Precision: 0.67
Recall: 0.15

Any emerging blockers or risks

No

Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?

No

Have there been any new lessons from the hypothesis?

No

Have there been any changes to the hypothesis scope or timeline?

No

Next steps

Run zero-shot experiments using AYA23 LLM hosted by the ML-team

Progress update

  • I've been working on the few-shot approach without good results. I've tried a set of prompts, changing the format, number and distribution of examples, but the LLM used (aya23) is not processing this examples correctly, and it is over-fitting to one class.
  • In parallel I've been working on improving the fine-tunning approach by refining the hyperparameters. Currently I'm reaching a 0.69 precision and 0.23 recall.

Have there been any new lessons from the hypothesis?

  • There are not well-established procedures to create a successful few-shot prompt. After reviewing the literature, review examples, and tried several prompts this approach doesn't as a good solution for detecting peacock behavior.

Have there been any changes to the hypothesis scope or timeline?

  • No

Progress update

  • This week I worked on improving the few-shot and fine-tune experiments. Unfortunately, the few-shot approach didn't show relevant improvements, so I decided to discard it.
  • It is important to say that few-shot learning is a new technique, still under deployment, and there might be several reasons why it didn't work for this task. Anyhow, it might worth to explore it again the future, when more clear procedures are established.
  • On the other hand, after some tweaks, the the fine-tuned Bert model improved significantly, reaching a 0.72 precision and 0.4 recall on balanced data

Have there been any new lessons from the hypothesis?

  • The current results suggests that if we aim to detect at least 40% of the cases of peacock behavior, the model would fail in 28% assessments. This is below, but not so far of our target. I think there is a product decision to be made, if we want to focus on precision (avoiding wrong classifications that can disturb the editors workflow), or in recall (focusing on trying to detect all the cases of peacock behavior, even if that implies to show more false positives)
  • So far, I've been focusing in model's precision, without considering other factors , such as the serving time (how long it takes to get an answer from the model). This would depend on the resources we are going to have, and also on the length of the revision we are processing. If we decide to proceed with this project I think we should have that conversation, to see what is reasonable processing time and if is possible to have it with our current resources.

Have there been any changes to the hypothesis scope or timeline?

  • No

Progress update

  • Experiments:
    • As planned I studied the ability of the model fine tuned to detect peacock behavior to detect other promotion-related content issues, described in this data set.
    • I run the model on 4 other datasets: {{fanpov}}, {{advert}}, {{autobiography}}, {{weasel}}
    • The results show (see below) a similar behavior with the peacock detection task. The model shows a good precision and low recall (lower for templates different from peacock). This suggest that there is information about promotional tone that can be detect by the model, and depending on the setup the model could focus on precision or recall
  • Coordination:
    • We have a meeting with Peter Pelberg, Nicola Ayub , and Megan Neisler to discuss next steps.
    • First, we decided that the model needs to be tested again a simple baseline, that can be just a string matching approach, looking for common peacock keywords. I’ll be working on this during the next week(s) (notice I’ll be OoO few days during the next two weeks)
    • Peter is going to decide if we want to go deeper on this specific task, and analysis how other factors related to transform this model into a product (serving time, ux, etc) or work on other tasks that involves ML and user experiences

Any new metrics related to the hypothesis

  • Here the results for the model fine tuned to detect peacock behavior in other similar {{templates}}. Remember that we are reporting Precision & Recall for simplicity and explanibility. The model is trained using f1 score (that is the harmonic mean of the precision and recall)
  • Remember that there is a trade-off between Precision and recall, and we could improve the recall by decreasing the precision
PrecisionRecall
peacock0.720.40
fanpov0.750.08
advert0.780.10
autobiography0.710.07
weasel0.710.06

Have there been any new lessons from the hypothesis?

  • We should compare these results with a simpler model based on keywords.

Have there been any changes to the hypothesis scope or timeline?

  • No

Progress update

  • I’m working on building a set of keywords related to peacock behavior and promotional tone. To do this, I’m using a TF-IDF approach, a well-known method to identify terms (keywords) that characterize a set of documents.
  • This and next week are short for me (taking several days off), so it might take a bit more time to finalize this.
  • I also communicated with my manager that there might be the possibility of trying to build a product based on the fine-tune model. In case we decide to move forward, we would need to coordinate with her and other teams involved how to proceed.

Any new metrics related to the hypothesis

  • No

Have there been any new lessons from the hypothesis?

  • No

Have there been any changes to the hypothesis scope or timeline?

  • No

Confirm if the hypothesis was supported or contradicted

The hypothesis was contradicted because LLMs were not able to provide the precision and recall expected. With other AI-based models (i.e. smaller language models) we were close to reach the target precision & recall, but still they were the numbers were below the thresholds.

Briefly describe what was accomplished over the course of the hypothesis work.

  • We tested two LLMs (AYA23 and Gemma) for a "peacock detection task". We tried a set of prompts and learning strategies (zero-shot, few-shot).
  • Specifically, we tested on 522 examples (261 positive and 261 negative).
  • The best result for an LLM, was for AYA23 on a zero-shot approach, obtaining a precision of 0.54 and recall equal to 0.24
  • We saw that although LLMs capture a signal (they work better than random), the precision & recall were low compared with simpler models such as smaller Language Models (BERT-like).
  • The baseline model, a fine-tuned Bert Model, was tested with the same data, and got a precision 0.72 and recall 0.4. Much better than the LLM approach, but still under the our target.

Major lessons

  • LLMs are promising technology, that makes simple to build a AI-based model, just with a simple prompt, the LLM was able to capture a signal on "promotional language" on Wikipedia article.
  • However, the precision and recall this type of tasks, is still not enough for creating a reliable product.
  • On the other hand, more established technologies, such as BERT, requires more work on training a model, but gives better results.
  • For future, when considering creating AI-based models for policy violation detection, these two approaches should be tested.
  • LLMs offers a lot of flexibility, and can work on scenarios without much training data, and potentially one LLM could be used to work on several policies/taks.
  • On the other hand, smaller models offers higher accuracy and significantly less computation resources, but requires training data and as well might imply a major maintenance efforts because they will be task/policy specific.

Next Steps

This week I'm going to meet Peter Pelberg to discuss how we can test LLMs in other policies, and try combine efforts with the SDS1.2.1 (Use of AI in Wikimedia services for readers and contributors).

diego updated the task description. (Show Details)
diego updated the task description. (Show Details)
Aklapper renamed this task from [W.E.1.2.4] Detecting Peacock behavior with LLMs to [WE1.2.4] Detecting Peacock behavior with LLMs.Wed, Oct 2, 5:57 PM