Page MenuHomePhabricator

Datahub occasional 500 errors
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
Datahub functionality comes to a standstill throwing Error 500 on almsost all sections excluding login.

What happens?:
Upon login to datahub and trying to navigate to searching/exploring data or roles management, users report an error message that reads. An unnown error occured. (code 500)

image.png (447×1 px, 46 KB)

image.png (448×1 px, 60 KB)

What should have happened instead?:
Users should be able to access any searches or role/admin capabilities they are authorised to.

Software version :
We have seen this on Datahub v0.10.4
The error persisted on Datahub v0.12.1

Other information :
Upon examining the logs of the datahub-gms-main container, we found that the errors were related to opensearch as below:

2024-05-02 07:45:06,707 [Thread-48424] ERROR c.l.m.s.e.query.ESSearchDAO:109 - Search query failed
org.opensearch.OpenSearchStatusException: OpenSearch exception [type=search_phase_execution_exception, reason=all shards failed]
...
2024-05-02 07:45:06,708 [Thread-48424] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:22 - Failed to execute DataFetcher
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to list roles

The opensearch cluster in all occasions has been running okay and a restart of the opensearch cluster does not have an effect on the errors shown on datahub.

How we have resolved the issue so far
For the past few occurences, we have resolved the issue by doing a Rolling_restart of the cluster as below:

stevemunene@deploy1002:~$ cd /srv/deployment-charts/helmfile.d/services/datahub/
$ helmfile -e codfw --state-values-set roll_restart=1 sync
$ helmfile -e eqiad --state-values-set roll_restart=1 sync

With this, the datahub-main-system-update-job rebuilds and cleans up the indices sample log below:

2024-05-02 09:11:10,082 [main] INFO  c.l.d.u.s.e.s.BuildIndicesPostStep:73 - Validated index dataprocessinstanceindex_v2 with new settings. Settings: {index.blocks.write=false}, Acknowledged: true
2024-05-02 09:11:10,083 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Completed Step 3/6: BuildIndicesPostStep successfully.
2024-05-02 09:11:10,083 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Executing Step 4/6: DataHubStartupStep...
2024-05-02 09:11:10,265 [main] INFO  c.l.d.u.s.e.steps.DataHubStartupStep:36 - Initiating startup for version: v0.12.1-20
2024-05-02 09:11:10,266 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Completed Step 4/6: DataHubStartupStep successfully.
2024-05-02 09:11:10,266 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Executing Step 5/6: CleanUpIndicesStep...
.
.
.
.
2024-05-02 09:11:12,634 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:604 - Checking for orphan index pattern dataprocessinstance_dataprocessinstanceruneventaspect_v1* older than 60 DAYS
2024-05-02 09:11:12,668 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Completed Step 5/6: CleanUpIndicesStep successfully.
2024-05-02 09:11:12,668 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Skipping Step 6/6: BackfillBrowsePathsV2Step...
2024-05-02 09:11:12,668 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Success! Completed upgrade with id SystemUpdate successfully.
2024-05-02 09:11:12,669 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - Upgrade SystemUpdate completed with result SUCCEEDED. Exiting...

Next steps
We are in the process of moving datahub from wikiKube to the dse-k8s-eqiad cluster T361185 which we hope will reduce/resolve the occurrence of this as we shall no longer have 2 gms copies accessing the opensearch cluster ie codfw eqiad.

Event Timeline

Gehel triaged this task as High priority.May 10 2024, 8:36 AM
Gehel subscribed.

Let's revisit this after the migration to k8s and see how it affects the issue

Datahub is now being served from dse-k8s, but I think I would start monitoring for stability once we have completed T366338: Delete datahub WikiKube release references.
Until that point, we will still potentially have three GMS services configuring the same set of OpenSearch indices.

BTullis claimed this task.

I think that we can close this issue, as we have not seen another occurrence since the migration of datahub to dse-k8s-eqiad.