Page MenuHomePhabricator

Evaluate the accuracy of 3 or more large language models on AI use-cases
Closed, ResolvedPublic

Assigned To
Authored By
Miriam
Oct 17 2024, 10:08 AM
Referenced Files
F57688289: image.png
Nov 7 2024, 10:43 PM
F57688285: image.png
Nov 7 2024, 10:43 PM
F57665301: image.png
Nov 1 2024, 4:16 AM

Description

Setup [2 weeks]

  • Define evaluation protocol with @Aroraakhil and create experiment book for each task
  • Based on sample data, estimate the volume of outgoing requests to GROQ. This will help us estimate budget.
  • Setup API calls

Full experiments [8 weeks]

NOTE: Update after setup phase is over.

Event Timeline

Status:

  • I have created the initial version of the experiment book for Article Categorization, available here. This document includes:
    • Definitions of strategies needed for model input formation, including category selection, content selection, and prompt types (along with corresponding prompt templates).
    • Each experiment will be defined as a combination of strategies for category selection, content selection, prompt type, and models.
  • I have tested the Groq platform to determine its limitations concerning the models to be tested and the prompting strategies.
  • I have created and tested the prompt templates within the Groq playground to work for both Llama and Mistral models.

Next Steps:

  • Reduce the number of strategies to fit within the project timelines.
  • Test the strategies on the pre-collected dataset sample to estimate the cost and time required to perform the experiments.
  • Begin the implementation for the experiments setup.
  • Analysed the collected data and communicated recommendations for improvement to @Aitolkyn.
  • Developed and implemented a process to create a dictionary mapping topics to categories using TF-IDF. This involved filtering out rare categories and selecting the top N categories based on non-zero TF-IDF scores.
  • Designed and implemented a draft for the LLMCategorizer class, which is initialized with configuration parameters for content, category, prompt, and model strategies. Initial testing was conducted on an extra small dataset. Improved initial prompts to maximize the Precision.
  • I implemented a draft code for the metric calculation to assess categorization performance (Precision (prioritized) and Recall).
  • Conducted preliminary estimates for Groq usage. Current limitations include the absence of few-shot learning, which could significantly lower performance estimates, and missing page content due to data unavailability.
    • Key Concern: At this stage, Groq seems to be a bottleneck. Even with all simplifications (no content, no few-shot learning), we are limited to processing approximately 1,100-1,600 samples per day (~1-2% of a full configuration sample) (with speed of 20-30 requests per minute). Approximate cost of $4-5 per 1,000 samples.
    • The key problem is the large set of potential categories (even after reduction). I think we need a more sophisticated (better than just based on Topic + TF IDF filter) first-level model to select a limited set of candidates with high recall before passing to LLM.
    • It is essential to investigate the possibility of increasing Groq's usage limits to overcome the current constraints.
  • All the mentioned code is added to separate branch at Gitlab
Experimentation with Category Reduction Strategies for Prompt Generation

Notebook with experiment: Gitlab link

Importance

Reducing the number of candidate categories provided in prompts is crucial because certain topics can have an extensive number of possible categories, averaging 966 categories per topic. This leads to high token consumption, increased costs, and longer generation times, particularly for non-English languages.

Data used
  • Testing Set 1: Subset of 50 articles per language.
  • Testing Set 2: Subset of 50 articles per language, created after 01.01.2024.
Strategies
  • Baseline (Random): Random selection of a subsample of categories related to the article topic.
  • TF-IDF Reduced: Selection of the most important N categories from those related to the article topic.
  • ANN Reduced: Selection of the top N (fixed at 50) most semantically similar categories to the article title or title + main section text from those related to the article topic.
    • Encoders Used:
      • paraphrase-multilingual-MiniLM-L12-v2
      • LaBSE
      • Alibaba-NLP/gte-multilingual-base
Summary

I don't see a significant difference for all and only new articles (created after 01.01.2024).

I recommend proceeding with the ANN + Alibaba-NLP/gte-multilingual-base N=50 candidates selection strategy (using main section text and title). The benefits of using this approach include:

  1. Achieving high recall metrics with a much smaller candidate set, comparable to TF-IDF with N=300.
  2. Relatively fast performance with the ANN index.
  3. The encoder model can handle up to an 8K token context window, allowing for the processing of relatively long texts.

In case of resource limitations, the more lightweight model "paraphrase-multilingual-MiniLM-L12-v2" can be used as an alternative.

Comparison tables:
Testing set 1 (Random subset (50 per country)):
ConfigurationPrecisionRecall
All categories0.008449361
Random selection N=500.01326960.195271
Random selection N=3000.009398940.68205
TF IDF limited N=3000.0116920.798875
TF IDF limited N=500.02008180.300298
ANN (title only) paraphrase-multilingual-MiniLM-L12-v2 N=500.02473960.408989
ANN (main section text + title) paraphrase-multilingual-MiniLM-L12-v2 N=500.04357270.643995
ANN (title only) LaBSE N=500.0253570.430726
ANN (main section text + title) LaBSE N=500.04085370.613614
ANN (title only) Alibaba-NLP/gte-multilingual-base N=500.03362070.526573
ANN (main section text + title) Alibaba-NLP/gte-multilingual-base N=500.05083170.748485
Testing set 2 (Random subset (50 per country) + created after 01.01.2024):
ConfigurationPrecisionRecall
All categories0.007797621
Random selection N=500.01373910.217919
Random selection N=3000.008853140.687462
TF IDF limited N=3000.01133570.818436
TF IDF limited N=500.02153040.338283
ANN (title only) paraphrase-multilingual-MiniLM-L12-v2 N=500.0239070.404777
ANN (main section text + title) paraphrase-multilingual-MiniLM-L12-v2 N=500.04031420.62498
ANN (title only) LaBSE N=500.02619260.447849
ANN (main section text + title) LaBSE N=500.03879540.598155
ANN (title only) Alibaba-NLP/gte-multilingual-base N=500.03204940.526546
ANN (main section text + title) Alibaba-NLP/gte-multilingual-base N=500.04801330.735601
Evaluation of Multiple Configurations Using Together AI
Estimated Spend
  • ~$30
Dataset
  • Sample of 50 random articles from each language (1150 samples in total)
Models Tested
  • meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
  • meta-llama/Meta-Llama-3-70B-Instruct-Turbo
  • meta-llama/Meta-Llama-3-70B-Instruct-Lite
Model Selection Logic

The goal was to run the full pipeline and compare "Turbo" vs. "Lite," and "Llama-3.1" vs. "Llama-3.” Model selection was also constrained by the available models on Together AI.

Content Configurations
  • Only article name
  • Article name and first paragraph
Category Selection
  • ANN Top-50 categories
Prompt Strategies
  • Zero-shot
  • Zero-shot + structured output + chain-of-thought
  • Few-shot (random fixed examples) + structured output + chain-of-thought
  • Few-shot (similar examples) + structured output + chain-of-thought
Configurations Tested
  • 24 configurations in total
Summary
  • Llama-3.1 works better than Llama-3 as expected.
  • No significant difference between "Lite" and "Turbo" versions. I would even say that “Llama 3 Lite” performs comparably to “Llama 3.1 Turbo.”
  • Llama-3 tends to hallucinate more and had issues with structured output returning (exp-011, exp-012).
  • Performance varies greatly by language, with more popular languages like English performing much better, as expected.
  • Configuration of few-shot with similar articles provided the best boost in performance across all the models.
Full Results:
experiment_codeexperiment_nameprecisionrecallempty_rateis_none_rateexecution_timemean total tokens
0exp-001[Meta-Llama-3.1-70B-Instruct-Turbo]-[only article name]-[zero-shot]0.480580.447174001.3341714.853
1exp-002[Meta-Llama-3.1-70B-Instruct-Turbo]-[only article name]-[zero-shot-structured]0.5390080.4606980.005217390.0008695651.58867844.508
2exp-003[Meta-Llama-3.1-70B-Instruct-Turbo]-[only article name]-[few-shot-default]0.5424420.4513280.01304350.021.697941480.35
3exp-004[Meta-Llama-3.1-70B-Instruct-Turbo]-[only article name]-[few-shot-similarity]0.6340170.4738040.0139130.0139131.635571837.32
4exp-005[Meta-Llama-3.1-70B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot]0.5218040.58969001.47246908.49
5exp-006[Meta-Llama-3.1-70B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot-structured]0.5709230.6102260.004347830.00260871.852571042.16
6exp-007[Meta-Llama-3.1-70B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-default]0.5690090.5943680.003478260.03130432.212181687.47
7exp-008[Meta-Llama-3.1-70B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-similarity]0.6594690.6304170.0008695650.02521741.795972040.21
8exp-009[Meta-Llama-3-70B-Instruct-Turbo]-[only article name]-[zero-shot]0.4308470.411476001.34386692.799
9exp-010[Meta-Llama-3-70B-Instruct-Turbo]-[only article name]-[zero-shot-structured]0.4600340.4320910.0017391301.60177823.082
10exp-011[Meta-Llama-3-70B-Instruct-Turbo]-[only article name]-[few-shot-default]0.2647780.2292460.0008695650.5130432.314091492.12
11exp-012[Meta-Llama-3-70B-Instruct-Turbo]-[only article name]-[few-shot-similarity]0.4103670.3525490.00260870.2834781.922041821.42
12exp-013[Meta-Llama-3-70B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot]0.5117110.571109001.3571883.872
13exp-014[Meta-Llama-3-70B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot-structured]0.5320880.590590.001739130.0008695651.741641016.71
14exp-015[Meta-Llama-3-70B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-default]0.5291030.5809730.0008695650.0460872.132811656.26
15exp-016[Meta-Llama-3-70B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-similarity]0.5973090.62306200.03304351.930871995.15
16exp-017[Meta-Llama-3-70B-Instruct-Lite]-[only article name]-[zero-shot]0.412940.32801001.84153688.277
17exp-018[Meta-Llama-3-70B-Instruct-Lite]-[only article name]-[zero-shot-structured]0.4420360.34376900.0008695652.57387815.713
18exp-019[Meta-Llama-3-70B-Instruct-Lite]-[only article name]-[few-shot-default]0.5112760.37499600.007826093.055081447.62
19exp-020[Meta-Llama-3-70B-Instruct-Lite]-[only article name]-[few-shot-similarity]0.5565690.4187870.0008695650.007826093.56311797.29
20exp-021[Meta-Llama-3-70B-Instruct-Lite]-[article name and first paragraph]-[zero-shot]0.5127550.509404002.25824879.958
21exp-022[Meta-Llama-3-70B-Instruct-Lite]-[article name and first paragraph]-[zero-shot-structured]0.5400820.5194120.0017391303.053511010.09
22exp-023[Meta-Llama-3-70B-Instruct-Lite]-[article name and first paragraph]-[few-shot-default]0.5489840.5411140.001739130.005217393.828411648.1
23exp-024[Meta-Llama-3-70B-Instruct-Lite]-[article name and first paragraph]-[few-shot-similarity]0.6208220.58752700.01217394.107571989.41
Precision per language and experiment:

image.png (540×2 px, 695 KB)

Notebook with Full Experiment

Metrics Interpretation:

In our experiment, I use list-based metrics to evaluate performance: Precision and Recall.

  • Precision (main metric): This measures the accuracy of the predictions. It is the rate at which the predicted categories appear in the ground truth.

image.png (142×744 px, 22 KB)

  • Recall: This metric assesses the completeness of the predictions. It measures the rate at which the ground truth categories are correctly predicted.

image.png (124×674 px, 21 KB)

Note 1: Precision and Recall does not take into account the order of the list
Note 2: If no categories are predicted, it results in 0 for Precision, which is one of the limitations of the metric (as predicting no category is better than predicting an incorrect one)
Note 3: We calculate Precision and Recall for each sample and calculate the average as the final metric for the specific configuration.

Example:

True categories: [Cat1, Cat2, Cat3, Cat4]
Predicted categories: [Cat1, Cat2, Cat5]

Precision: 2/3
Recall: 2/4

Experiments with Llama7B and Mixtral

Experiment setup:

  • I slightly changed the experiment setup to work with Mixtral models: I changed the type of message from “system” to “user.”
  • Reduced experiment sets, removing obviously worse strategies based on previous observations (exp-001-024). I have removed strategies that are not using the content of the page. I have left only two prompting strategies: “Zero-shot + structured output + chain-of-thought” and “Few-shot (similar examples) + structured output + chain-of-thought”

Result table:

experiment_codeexperiment_nameprecisionrecallempty_rateis_none_raten_predicted_meanexecution_timemean total tokens
0exp-025[Meta-Llama-3.1-8B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot-structured]0.2971990.47636900.1356525.386961.324491102.87
1exp-026[Meta-Llama-3.1-8B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-similarity]0.192810.20477900.660871.279131.824472138.84
2exp-027[Mixtral-8x7B-Instruct-v0.1]-[article name and first paragraph]-[zero-shot-structured]0.4303210.52657300.04173914.766091.352121365.32
3exp-028[Mixtral-8x7B-Instruct-v0.1]-[article name and first paragraph]-[few-shot-similarity]0.4605150.48094400.1747833.290431.554752930.17
4exp-029[Mistral-7B-Instruct-v0.3]-[article name and first paragraph]-[zero-shot-structured]0.3661190.48474200.05652175.193041.811721389.83
5exp-030[Mistral-7B-Instruct-v0.3]-[article name and first paragraph]-[few-shot-similarity]0.4985720.5491030.0008695650.03217394.059131.879672920.75

Summary:

  1. Both small Llama8B and any Mixtral models work considerably worse than Llama 70B
  2. Llama8B and Mixtral models have issues following structured output, resulting in high is_none_rate and low precision and recall.
  3. No big difference between Mixtral8x7B and Mistral-7B

Experiments with forcing structured output

Considering the issue with the structured output, it was also decided to test the strategy of forcing structured output using the functionality provided by TogetherAI. It is available only for a limited amount of models, but it still allows for evaluation of its effectiveness.

Result table:

experiment_codeexperiment_nameprecisionrecallempty_rateis_none_raten_predicted_meanexecution_timemean total tokens
0exp-031[forced-structure][Meta-Llama-3.1-8B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot-structured]0.3444390.52905900.01217395.865223.506421078.41
1exp-032[forced-structure][Meta-Llama-3.1-8B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-similarity]0.545090.55551100.01565223.68872.517342043.55
2exp-033[forced-structure][Mixtral-8x7B-Instruct-v0.1]-[article name and first paragraph]-[zero-shot-structured]0.4056560.44024200.2017393.914786.077651384.24
3exp-034[forced-structure][Mixtral-8x7B-Instruct-v0.1]-[article name and first paragraph]-[few-shot-similarity]0.4977050.4708360.004347830.1982613.138264.602852934.69
4exp-035[forced-structure][Meta-Llama-3.1-70B-Instruct-Turbo]-[article name and first paragraph]-[zero-shot-structured]0.5764470.6081350.004347830.003478263.806092.624861044.58
5exp-036[forced-structure][Meta-Llama-3.1-70B-Instruct-Turbo]-[article name and first paragraph]-[few-shot-similarity]0.670690.6267320.01130430.01217393.340872.767542036.37

Summary:

  1. Forcing structure improved a lot the results for Llama-3.1-8B, but it is still much worse than Llama-3.1-70B
  2. Forcing structure for Mixtral-8x7B has changed the behavior but not the metrics. The model started to work very poorly for specific non-latin languages (e.g., jawiki, kowiki, zhwiki).
  3. As for Llama-3.1-70B, the result only slightly improved (almost no difference).
Peacock detection (initial experiments)

Task: Binary Classification
Metrics: Accuracy (balanced dataset), Precision, Recall, F1
Data Sample: Initially took a sample of 50 random page_ids from each language (1222 revisions). Limit text to 20K chars (cover >90% of articles)

Models Tested:

  • Meta-Llama-3.1-70B-Instruct-Turbo
  • Meta-Llama-3.1-70B-Instruct-Lite
  • Meta-Llama-3.1-8B-Instruct-Turbo
  • Mistral-7B-Instruct-v0.3
  • Mixtral-8x7B-Instruct-v0.1

Prompts Tested:

  • Simple
  • Extended (added explanation for peacock patterns)

Summary:

  • Overall performance of all models is quite low.
  • Adding additional information about peacock template logic does not improve performance.
  • Performance is similar across all languages.
  • Meta-Llama-3.1-70B-Instruct-Turbo does not differ significantly from Meta-Llama-3.1-70B-Instruct-Lite. All other models perform much worse.

Code: Gitlab link

Results:

accuracyprecisionrecallf1is_na_rateexperiment_nameexperiment_code
0.568740.6640620.2782320.3921570[Meta-Llama-3.1-70B-Instruct-Turbo]-[simple]exp-100
0.5400980.6960.142390.2364130[Meta-Llama-3.1-70B-Instruct-Turbo]-[extended]exp-101
0.5695580.6065160.3960720.4792080.0204583[Meta-Llama-3-70B-Instruct-Lite]-[simple]exp-102
0.5081830.5041740.9885430.6677720[Meta-Llama-3.1-8B-Instruct-Turbo]-[simple]exp-103
0.5360070.5313390.6104750.5681650[Mistral-7B-Instruct-v0.3]-[simple]exp-104
0.5229130.6489360.09983630.173050[Mixtral-8x7B-Instruct-v0.1]-[simple]exp-105
0.5245500.5382650.3453360.4207380.040098[aya-expanse-8B]-[simple]exp-106

Accuracy per language:

ruwikifrwikizhwikiarwikienwikidewikieswikijawikinlwikiptwikiexperiment_nameexperiment_code
0.5909090.5769230.5526320.588710.5882350.5297620.5250.551020.6160710.6[Meta-Llama-3.1-70B-Instruct-Turbo]-[simple]exp-100
0.5454550.5192310.5263160.5403230.5588240.5357140.51250.5204080.56250.584615[Meta-Llama-3.1-70B-Instruct-Turbo]-[extended]exp-101
0.5727270.5480770.5087720.6451610.5882350.5535710.5250.6122450.5982140.569231[Meta-Llama-3-70B-Instruct-Lite]-[simple]exp-102
0.50.5096150.5263160.50.5098040.523810.493750.50.50.515385[Meta-Llama-3.1-8B-Instruct-Turbo]-[simple]exp-103
0.5545450.5576920.5263160.5161290.5196080.5476190.518750.5204080.5803570.523077[Mistral-7B-Instruct-v0.3]-[simple]exp-104
0.50.5096150.5263160.6290320.5098040.5059520.506250.50.5178570.523077[Mixtral-8x7B-Instruct-v0.1]-[simple]exp-105

NPOV violation detection (initial experiments)

Task: Binary Classification
Metrics: Accuracy (balanced dataset), Precision, Recall, F1
Data Sample: Initially took a sample of 30 random page_ids from each language (1340 revisions). Limit text to 30K chars (cover >90% of articles)

Models Tested:

  • Meta-Llama-3.1-70B-Instruct-Turbo
  • Meta-Llama-3.1-70B-Instruct-Lite
  • Meta-Llama-3.1-8B-Instruct-Turbo
  • Mistral-7B-Instruct-v0.3
  • Mixtral-8x7B-Instruct-v0.1

Prompts Tested:

  • Simple (only one prompt tested)

Summary:

  • Overall performance of all models is quite low (close to random)
  • Performance is similar across all languages.
  • No difference for models.

Code: Gitlab link

Results:

accuracyprecisionrecallf1is_na_rateexperiment_nameexperiment_code
0.5395520.5916960.2552240.3566210[Meta-Llama-3.1-70B-Instruct-Turbo]-[simple]exp-200
0.5291040.5822780.205970.30430.190299[Meta-Llama-3-70B-Instruct-Lite]-[simple]exp-201
0.5305970.5217390.7343280.6100430[Meta-Llama-3.1-8B-Instruct-Turbo]-[simple]exp-202
0.5171640.5098040.8925370.6489420.0134328[Mistral-7B-Instruct-v0.3]-[simple]exp-203
0.5291040.5232420.6552240.5818420.0134328[Mixtral-8x7B-Instruct-v0.1]-[simple]exp-204
0.5014930.5131580.0582090.1045580.077612[aya-expanse-8B]-[simple]exp-205

Accuracy per language:

eswikinlwikizhwikidewikiruwikiarwikifrwikijawikiukwikienwikifawikikowikitrwikielwikiitwikicswikiptwikiexperiment_nameexperiment_code
0.6093750.5277780.5714290.50.50.5555560.4743590.5454550.5319150.5128210.5833330.5106380.56250.5340910.5333330.5526320.5625[Meta-Llama-3.1-70B-Instruct-Turbo]-[simple]exp-200
0.5468750.5416670.5714290.5161290.5147060.5555560.5384620.5151520.4893620.5256410.523810.4893620.531250.4886360.5166670.6052630.53125[Meta-Llama-3-70B-Instruct-Lite]-[simple]exp-201
0.50.5138890.5285710.5322580.5441180.5793650.5128210.5909090.4787230.5256410.5476190.4787230.531250.5113640.5166670.5131580.625[Meta-Llama-3.1-8B-Instruct-Turbo]-[simple]exp-202
0.4843750.5416670.5142860.4677420.5147060.5079370.5384620.5303030.4893620.4871790.523810.5425530.4895830.50.5666670.5131580.609375[Mistral-7B-Instruct-v0.3]-[simple]exp-203
0.4843750.5138890.5285710.5161290.5588240.5793650.5384620.5151520.5319150.5128210.5119050.5425530.56250.4318180.50.5131580.625[Mixtral-8x7B-Instruct-v0.1]-[simple]exp-204

Experiments with Aya Model

Task: Binary Classification (NPOV, Peacock detection)
Metrics: Accuracy (balanced dataset), Precision, Recall, F1
Data Sample: Same as for exp-100 (-105), exp-200 (-204).

Summary:

  • The Aya model struggled significantly in providing structured output, necessitating prompt adaptation. Despite these adjustments, the rate of invalid results remains high.
  • The local inference endpoint proved to be reliable and relatively fast.
  • The performance of the Aya model is notably lower compared to the LLaMA and Mixtral models under the current setup.
  • Model performance is sensitive to prompt minor changes.

Code: Gitlab link

Results Table:

(additionally updated previous tables for consistency)

accuracyprecisionrecallf1is_na_rateexperiment_nameexperiment_codetask
0.5014930.5131580.0582090.1045580.077612[aya-expanse-8B]-[simple]exp-205npov
0.5245500.5382650.3453360.4207380.040098[aya-expanse-8B]-[simple]exp-106peacock

Full data experiments for NPOV and Peacock templates:

Summary:

  • Updated (improved) prompts based on what was learned from the Aya case. Tested on sample to validate that it performs slightly better.
  • Performed evaluation on the full testing data for two templates and four models. Meta-Llama-3.1-70B-Instruct-Turbo was skipped for now due to issues with the LLM provider.
  • The code to complete experiments (including the specific prompt) was loaded to repo: for npov, for peacock
  • Metrics calculation, all plots, and full analysis is presented here: for npov, for peacock
  • To reproduce the prediction for a specific model (e.g., meta-llama/Meta-Llama-3-70B-Instruct-Lite in this case), run:
export TOGETHER_AI_KEY=“<key>”

python 09_peacock_script.py --model "meta-llama/Meta-Llama-3-70B-Instruct-Lite" --experiment "llama370B" --input "/srv/home/aitolkyn/ai-use-cases/data/peacock_test.csv.gz"

python 10_npov_script.py --model "meta-llama/Meta-Llama-3-70B-Instruct-Lite" --experiment "llama370B" --input "/srv/home/aitolkyn/ai-use-cases/data/npov_test.csv.gz"

Results for NPOV:

modelaccuracyf1precisionrecallfalse_positive_ratefalse_negative_rateis_nan_rate*
Meta-Llama-3.1-8B-Instruct-Turbo0.5277910.5661810.4854480.6791230.5979070.3208770.105125
Meta-Llama-3.1-70B-Instruct-Turbo0.5835110.5496850.5982750.5083950.3413740.4916052.95613e-05
Meta-Llama-3-70B-Instruct-Lite0.5567580.4008630.6183430.2965590.1830440.7034410.0210477
Llama-3-70b-chat-hf0.566010.4142830.6369770.3069650.1749440.6930350.0210477
Mistral-7B-Instruct-v0.30.5252750.3656330.5508870.2736190.223070.7263810

Results for Peacock:

modelaccuracyf1precisionrecallfalse_positive_ratefalse_negative_rateis_nan_rate*
Meta-Llama-3.1-8B-Instruct-Turbo0.5605940.6560710.5389610.8382030.7170160.1617970
Meta-Llama-3.1-70B-Instruct-Turbo0.6064410.5501710.6419560.4813490.2684670.5186510
Meta-Llama-3-70B-Instruct-Lite0.5931640.5599720.6097170.5177320.3314040.4822680.0114388
Llama-3-70b-chat-hf0.5918780.4880130.6546070.3890110.2052550.6109890.0114388
Mistral-7B-Instruct-v0.30.5742830.6235220.5588810.7050720.5565050.2949280

*error rate refers to cases when we could not get the prediction that might happen, for example, because of a response of unexpected format (parsing error) or API error. In that case, the prediction was filled with a “0” label.

Experiments Finalization for NPOV and Peacock Templates

  • Conducted full-scale experiments using Meta-Llama-3.1-70B-Instruct-Turbo and integrated the results into the general performance comparison table (see previous comment for full comparison).
  • Performed temporal performance analysis to compare the model's effectiveness on data before and after January 1, 2024:
  • Implemented confidence interval calculations for evaluation metrics and updated the respective notebooks:
  1. I have updated the final report document by incorporating detailed information about the experiments with LLMs. Specifically, I have provided technical details, code links, results, and summaries for the Peacock tone, NPOV violation, and article categorization tasks.
  2. I have been exploring a strategy that could enable the extraction of probability scores using the logprobs of prompt tokens feature of large language models (LLMs), as opposed to the current binary prediction approach.
  3. I started working on running full experiments for the Citation needed template using an approach similar to Peacock tone and NPOV violation detection.

Citation Needed experiments

  • Build the pipeline that allows the extraction of the probabilities for each class. It needs additional checking for each model due to tokenization differences.
  • The results obtained are slightly better than random with an AUC score of ~0.6.
  • The results don’t differ too much between the models, with the best model Meta-Llama-3-70B-Instruct-Lite

Final results:

accuracyprecisionrecallf1is_na_rateroc_auc_scoremax_f1thresholdpr_aucexperiment_codeexperiment_name
0.5636540.5438690.7891470.64394200.6103340.6697840.0001990070.60645exp-666[Llama-3-70b-chat-hf]
0.5741170.5798130.5384310.55835600.6054680.6668620.00826760.599305exp-667[Meta-Llama-3.1-70B-Instruct-Turbo]
0.567950.5471820.7880330.64588500.6229190.6681050.001689980.623246exp-668[Meta-Llama-3-70B-Instruct-Lite]
0.5672340.5674810.5654040.56644100.5983320.6680270.0495890.590752exp-669[Meta-Llama-3.1-8B-Instruct-Turbo]
0.5192950.5111430.8851050.64804400.5586240.6667550.0002965360.543625exp-670[Mistral-7B-Instruct-v0.3]

Code: link

Moving to backlog until Diego is back and can confirm that this work has been wrapped up and should be resolved.

Closing out - this work is complete.