Problem:
We would like to see if the new model is better than the old model in predicting the quality of Items. To do this we want to check how the old and new model performs with the new training data we collected.
Acceptance criteria:
- we have an overview of how many Items the old model judges to be A/B/C/D/E class compared to the human judgement
- we have an overview of how many Items the new model judges to be A/B/C/D/E class compared to the human judgement