Currently, our InferenceServices have their resource requests and limits set based on intuition and needs gathered during development of the service. However, we can now optimize the existing resource configuration further by exploring the Grafana container details dashboard and verifying the resource utilization in production clusters.
This will allow us to free up CPU and Memory resources in the LiftWing cluster, allowing us to deploy more services or further scale the existing services.
I've created a table with InferenceServices that we could optimize, I've omitted services already optimized or using low resources in general. It's ordered by biggest possible saving, so let's start tackling those from top to bottom:
| Service | Current CPU | Current Mem | New CPU | New Mem | Savings | Status |
| article-descriptions | 16 | 5Gi | 2 | 5Gi | 14 CPU | ✅ |
| llm/aya-llm | 6 | 8Gi | 1 | 8Gi | 5 CPU | TODO |
| llm/embeddings | 6 | 8Gi | 1 | 8Gi | 5 CPU | TODO |
| edit-check | 4 | 8Gi | 1 | 4Gi | 3 CPU, 4Gi | TODO |
| revertrisk-multilingual | 4 | 6Gi | 2 | 6Gi | 2 CPU | TODO |
| revertrisk-multilingual-pre-save | 4 | 6Gi | 2 | 6Gi | 2 CPU | TODO |
| revertrisk-wikidata | 2 | 4Gi | 1 | 4Gi | 1 CPU | TODO |
| reference-risk | 2 | 2Gi | 1 | 2Gi | 1 CPU | TODO |
| article-country | 2 | 2Gi | 1 | 1Gi | 1 CPU, 1Gi | TODO |
| reference-need | 22 | 6Gi/8Gi | 22 | 7Gi | 1Gi | TODO |