Page MenuHomePhabricator

Semantic Search POC - In article QA
Open, Needs TriagePublic

Assigned To
Authored By
OKarakaya-WMF
Sep 23 2025, 2:06 PM
Referenced Files
F68794753: image.png
Oct 31 2025, 8:40 AM
F68794627: image.png
Oct 31 2025, 8:40 AM
F66738155: image.png
Oct 8 2025, 8:04 AM
F66738153: image.png
Oct 8 2025, 8:04 AM
F66733772: image.png
Oct 6 2025, 1:41 PM
F66722022: image.png
Oct 3 2025, 7:28 AM
F66719675: image.png
Oct 2 2025, 11:49 AM
F66719673: image.png
Oct 2 2025, 11:49 AM

Description

FY25-26 WE3.1.6: If we produce a prototype for in-article Q&A, delivered as a demo interface, then the Reader teams will be able to qualitatively evaluate the approach performs across different user journeys and surface gaps or opportunities for further iteration.

  • Starting on a small dataset with 10 articles (2 per quality class.):
  • Generate questions/answers by using at least two LLMs. Answers are only for evaluation and making sure the questions are relevant.
  • Develop a ranking strategy.
  • Pick max top 5 questions/answers per article.
  • Develop a strategy for correctness checks/evaluation and run on the dataset.
  • Iterate prompts based on the small-dataset results. Human annotation could be useful.
  • Enlarging the experiment on a large dataset that is stratified random sample of 500 articles from English Wikipedia:
  • Sampling method should account for:
    1. content length diversity
    2. topic diversity
    3. content age diversity
    4. content quality diversity
  • Generate questions on the larger dataset and the selected LLMs from the previous iteration.
  • Share scores.
  • Prototype interface:
  • Allows people to select an article from the predefined list (shown as a dropdown menu, as a search bar with auto-complete, or something else)
  • Fetches the top 3 questions (ranked and filtered) for each article
  • Shows answers from semantic search. We have switched to adding a button to the prototype UI for consistency.
  • Displays questions and results in a table

Event Timeline

Results for both gpt-oss:20b and aya-expanse:32b are available in the spreadsheet.

I've updated checks to a rubric based approach to:

  • Get better insights from generated QA
  • Compare models from multiple perspectives.

Rubric has the following structure:

{
    "question_quality": {
        "question_clarity": {
            "type": "integer",
            "description": "Is the question grammatically correct, clear, and unambiguous? (1=unclear, 10=very clear)"
        },
        "question_relevance_to_title": {
            "type": "integer",
            "description": "Is the question directly mention the title? (1=unrelated, 10=directly relevant)"
        },
        "question_relevance_to_content": {
            "type": "integer",
            "description": "Does the question directly relate to the provided Wikipedia content? (1=unrelated, 10=directly relevant)"
        },
        "question_specificity": {
            "type": "integer",
            "description": "There must be only one clear question. Multiple questions in the question are not allowed. (1=vague, 10=precise)"
        },
        "curiosity": {
            "type": "integer",
            "description": "Does the question make the reader curious to learn more about the article? (1=trivial, 10=curious)"
        }
    },
    "answer_quality": {
        "answer_correctness": {
            "type": "integer",
            "description": "Is the answer factually accurate according to the passage? (1=incorrect, 10=fully correct)"
        },
        "answer_alignment": {
            "type": "integer",
            "description": "Does the answer directly and naturally respond to the question? (1=mismatched, 10=aligned)"
        },
        "answer_relevance_to_content": {
            "type": "integer",
            "description": "Is the answer directly mentioned in the provided Wikipedia content? (1=unrelated, 10=directly relevant)"
        }
    },
    "overall_quality": {
        "overall_alignment": {
            "type": "integer",
            "description": "Does the answer directly and naturally respond to the question according to the provided Wikipedia content? (1=mismatched, 10=aligned)"
        },
        "overall_usefulness": {
            "type": "integer",
            "description": "Would this Q&A pair help someone learn or test knowledge about the article? (1=not useful, 10=very useful)"
        }
    },
    "rationale": {
        "type": "string",
        "description": "Brief justification for the scores."
    }
}

For each item in the rubric, title, question, answer, content tuples get a score between 1 and 10.
I tuned the rubric prompt by using a small dataset (see semantic_search_poc_qa_eval_rubric_test tab in the spreadsheet) by annotating the expected low scores.

Then I first run it on a model that I expect to get a low overall score (llama3.2)

  • Overall results:
eval_model_name: gpt-oss:120b-cloud
model_name: aya-expanse:32b: 9.236666666666668
model_name: gpt-oss:20b: 9.2
model_name: llama3.2: 8.602857142857143
  • Detailed results:
model_name: gpt-oss:20b
eval_model_name: gpt-oss:120b-cloud
question_clarity                 9.783333
question_relevance_to_title      9.150000
question_relevance_to_content    9.916667
question_specificity             9.983333
curiosity                        6.333333
answer_correctness               9.533333
answer_alignment                 9.600000
answer_relevance_to_content      9.866667
overall_alignment                9.600000
overall_usefulness               8.233333
all: 9.2


model_name: aya-expanse:32b
eval_model_name: gpt-oss:120b-cloud
question_clarity                 9.866667
question_relevance_to_title      9.283333
question_relevance_to_content    9.800000
question_specificity             9.950000
curiosity                        6.333333
answer_correctness               9.583333
answer_alignment                 9.783333
answer_relevance_to_content      9.700000
overall_alignment                9.633333
overall_usefulness               8.433333
all: 9.236666666666668


model_name: llama3.2
eval_model_name: gpt-oss:120b-cloud
question_clarity                 9.428571
question_relevance_to_title      8.800000
question_relevance_to_content    9.514286
question_specificity             9.857143
curiosity                        6.171429
answer_correctness               8.514286
answer_alignment                 8.885714
answer_relevance_to_content      8.800000
overall_alignment                8.657143
overall_usefulness               7.400000
all: 8.602857142857143

The scores are deterministic. I've tested it with model_name: aya-expanse:32b eval_model_name: gpt-oss:20b

I see we get similar scores for gpt-oss:20b and aya-expanse:32b and llama3.2 is clearly lower.
I'll check further the items with low scores.

Looking into the question related scores, we generally get low scores in question_relevance_to_title and curiosity.

I'll update the prompts and try again.

image.png (841×834 px, 174 KB)

image.png (837×813 px, 167 KB)

Question quality average scores:
gpt question_quality_average: 9.0
aya question_quality_average: 9.6

I've updated the prompt based on the previous scores.

question_relevance_to_title has increased although curiosity is even lower now.
I'll continue with creating the larger dataset and generate question on it. This should give us more stabile evaluation results.

image.png (781×871 px, 140 KB)

I've started a toolforge app
This is a Streamlit app where we keep the data in gitlab registry

In advance, we installed an ollama instance on mllab and generated questions by using the AMD GPU.

image.png (1×786 px, 142 KB)

Based on the documentation

ssh ozge@login.toolforge.org
become semantic-search-qa
toolforge webservice stop
toolforge build start https://gitlab.wikimedia.org/toolforge-repos/semantic-search-qa --ref main
toolforge webservice buildservice start --mount all
toolforge webservice buildservice logs

Files moved to gitlab registry from:

scp ozge@ml-lab1002.eqiad.wmnet:/home/ozge/repos/wiki/gitlab/exploratory-notebook/semantic_search_poc/src/data/semantic_search_poc_qa_aya-expanse:32b.csv .

The LLM evaluation could be a little bit systematic. Currently GPS-OSS is 20B param mode, which we are comparing with Aya Expanse 32B. And Aya Expanse is an older model too. The prompt need to be versioned to see which is better. I would focus on figuring out a prompt and LLM param range that gives reasonable score and keep the LLM choice flexible since LLMs are are moving target and there are many practical considering to pick one.

Sharing the results for the larger dataset below.
I used evaluation model and the query model as same due to the limits on the cloud models.

image.png (812×789 px, 154 KB)

image.png (785×826 px, 153 KB)

hello @santhosh,

Thank you for the comments.
We can run the experiments on larger LLMs. I've checked that we can use some larger models (tested with gpt-oss:120b, llama4:maverick) on mllab.
I'll check further some public benchmarks, and see if we can re-run the experiments on a different set of LLMs.
I'll revisit the evaluation part if we can do it better with minimum human effort.

  • I've checked several benchmarks related to QA generation:

MMLU Helm:
https://crfm.stanford.edu/helm/mmlu/latest/#/leaderboard

livebench:
https://livebench.ai/#/

vellum:
https://www.vellum.ai/llm-leaderboard?utm_source=google&utm_medium=organic

artificialanalysis.ai:
https://artificialanalysis.ai/leaderboards/models?open_weights=open_source

huggingface_lmarena:
https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard

scale:
https://scale.com/leaderboard/mask

scale:
https://scale.com/leaderboard/humanitys_last_exam

huggingface MMLU:
https://huggingface.co/spaces/StarscreamDeceptions/Multilingual-MMLU-Benchmark-Leaderboard

  • Following open-weight models seem to appear in most of them on top:

aya:35b https://ollama.com/library/aya
gpt-oss:120b https://ollama.com/library/gpt-oss
deepseek-v3.1:671b: https://ollama.com/library/deepseek-v3.1
qwen3:235b: https://ollama.com/library/qwen3
llama4:maverick: https://ollama.com/library/llama4:maverick

  • I'll pick 3 of them for the sake of time and generate questions.
  • Evaluation takes less time than generating questions. I think I'll evaluate models with all of them (3 models in total) and calculate averages.

I've split the models below into two groups:

models for question generation:
aya:35b
gpt-oss:120b
deepseek-v3.1:671b

models for evaluation:
qwen3:235b
llama4:maverick

I'm curious to see if we will get similar evaluation results from qwen and llama4

I've updated evaluation rubric as follows to make it more question focused.

{
    "question_quality": {
        "question_clarity": {
            "type": "integer",
            "description": "Is the question grammatically correct, clear, and unambiguous? (1=unclear, 10=very clear)"
        },
        "question_relevance_to_title": {
            "type": "integer",
            "description": "Is the question directly mention the title? (1=unrelated, 10=directly relevant)"
        },
        "question_relevance_to_content": {
            "type": "integer",
            "description": "Does the question directly relate to the provided Wikipedia content? (1=unrelated, 10=directly relevant)"
        },
        "question_specificity": {
            "type": "integer",
            "description": "There must be only one clear question. Multiple questions in the question are not allowed. (1=vague, 10=precise)"
        },
        "curiosity": {
            "type": "integer",
            "description": "Does the question make the reader curious to learn more about the article? (1=trivial, 10=curious)"
        },
        "answer_relevance": {
            "type": "integer",
            "description": "Is the question directly matches with the answer considering the given content? (1=unrelated, 10=directly relevant)"
        },
        "context_relevance": {
            "type": "integer",
            "description": "Can we directly extract the question from the given content? (1=unrelated, 10=directly relevant)"
        },
        "overall_usefulness": {
            "type": "integer",
            "description": "Would this Q&A pair help someone learn or test knowledge about the article? (1=not useful, 10=very useful)"
        },
        "overall_alignment": {
            "type": "integer",
            "description": "Does the answer directly and naturally respond to the question according to the provided Wikipedia content? (1=mismatched, 10=aligned)"
        }
    },
    "rationale": {
        "type": "string",
        "description": "Brief justification for the scores."
    }
}

I've added questions from two large models into the prototype ui.
gpt-oss:120b, aya:35b
Overall evaluation is in progress.

Unfortunately, we can not run following model in mllabs instances due to gpu limitations:
deepseek-v3.1:671b
qwen3:235b
llama4:maverick

I'm sharing final evaluation results for this phase:

We evaluated two models ("aya:35b", "gpt-oss:120b"). We have cross matched eval_models and calculated averages.

Quantitatively, gpt-oss:120b gets higher average scores. gpt-oss:120b has higher scores in all metrics except for question_relevance_to_title and curiosity.
Qualitatively, aya:35b tends to generate more general questions that everybody could ask. gpt-oss:120b generates more detailed questions that may require prior knowledge about the topic. gpt-oss:120b questions are longer than aya:35b questions.

We see a correlation between question quality and article quality. Question quality increases as article quality increases.
This can help to pick a subset of articles for production.

As this evaluation is mostly based on LLMs (llm-as-a-judge), we suggest more extensive human evaluation.

image.png (414×508 px, 56 KB)

image.png (980×644 px, 65 KB)

I'm closing this ticket with the findings above. Please feel free to add comments for any additional information.