Steps to replicate the issue (include links if applicable):
Datahub functionality comes to a standstill throwing Error 500 on almsost all sections excluding login.
What happens?:
Upon login to datahub and trying to navigate to searching/exploring data or roles management, users report an error message that reads. An unnown error occured. (code 500)
What should have happened instead?:
Users should be able to access any searches or role/admin capabilities they are authorised to.
Software version :
We have seen this on Datahub v0.10.4
The error persisted on Datahub v0.12.1
Other information :
Upon examining the logs of the datahub-gms-main container, we found that the errors were related to opensearch as below:
2024-05-02 07:45:06,707 [Thread-48424] ERROR c.l.m.s.e.query.ESSearchDAO:109 - Search query failed org.opensearch.OpenSearchStatusException: OpenSearch exception [type=search_phase_execution_exception, reason=all shards failed] ... 2024-05-02 07:45:06,708 [Thread-48424] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:22 - Failed to execute DataFetcher java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to list roles
The opensearch cluster in all occasions has been running okay and a restart of the opensearch cluster does not have an effect on the errors shown on datahub.
How we have resolved the issue so far
For the past few occurences, we have resolved the issue by doing a Rolling_restart of the cluster as below:
stevemunene@deploy1002:~$ cd /srv/deployment-charts/helmfile.d/services/datahub/ $ helmfile -e codfw --state-values-set roll_restart=1 sync $ helmfile -e eqiad --state-values-set roll_restart=1 sync
With this, the datahub-main-system-update-job rebuilds and cleans up the indices sample log below:
2024-05-02 09:11:10,082 [main] INFO c.l.d.u.s.e.s.BuildIndicesPostStep:73 - Validated index dataprocessinstanceindex_v2 with new settings. Settings: {index.blocks.write=false}, Acknowledged: true 2024-05-02 09:11:10,083 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Completed Step 3/6: BuildIndicesPostStep successfully. 2024-05-02 09:11:10,083 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Executing Step 4/6: DataHubStartupStep... 2024-05-02 09:11:10,265 [main] INFO c.l.d.u.s.e.steps.DataHubStartupStep:36 - Initiating startup for version: v0.12.1-20 2024-05-02 09:11:10,266 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Completed Step 4/6: DataHubStartupStep successfully. 2024-05-02 09:11:10,266 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Executing Step 5/6: CleanUpIndicesStep... . . . . 2024-05-02 09:11:12,634 [main] INFO c.l.m.s.e.i.ESIndexBuilder:604 - Checking for orphan index pattern dataprocessinstance_dataprocessinstanceruneventaspect_v1* older than 60 DAYS 2024-05-02 09:11:12,668 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Completed Step 5/6: CleanUpIndicesStep successfully. 2024-05-02 09:11:12,668 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Skipping Step 6/6: BackfillBrowsePathsV2Step... 2024-05-02 09:11:12,668 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Success! Completed upgrade with id SystemUpdate successfully. 2024-05-02 09:11:12,669 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - Upgrade SystemUpdate completed with result SUCCEEDED. Exiting...
Next steps
We are in the process of moving datahub from wikiKube to the dse-k8s-eqiad cluster T361185 which we hope will reduce/resolve the occurrence of this as we shall no longer have 2 gms copies accessing the opensearch cluster ie codfw eqiad.