Page MenuHomePhabricator

Some users' presto queries are no longer working in Superset
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
BTullis
Jan 27 2023, 5:23 PM
Referenced Files
F36817549: image.png
Feb 9 2023, 1:26 PM
F36563610: image.png
Jan 31 2023, 4:02 PM
F36563608: image.png
Jan 31 2023, 4:02 PM
F36525592: image.png
Jan 27 2023, 5:23 PM
F36525594: image.png
Jan 27 2023, 5:23 PM

Description

Data Engineering Bug Report or Data Problem Form.

Please fill out the following

What kind of problem are you reporting?

  • Access related problem
  • Service related problem
  • Data related problem
For a service related problem:
  • What is the nature of the issue? Queries to presto from Superset are failing for two users, possibly also others.
  • What are the steps to reproduce the issue? The users affected are matmarex and cmyrick
  • What happens? Unexpected error from the presto database
Users known to be affected
Unsers known to be unaffected

image.png (1×2 px, 179 KB)

image.png (206×1 px, 9 KB)

  • What should happen instead?

This even happens if the simple query SELECT 1 is entered in the SQLLab console.

For the DE Team to fill out
Which systems does this effect?
  • Superset
Impact Assessment:

Does this problem qualify as an incident?

  • Yes
  • No

Does this violate an SLO?

  • Yes - (for the specific users)
  • No
Value CalculatorRank
Will this improve the efficiency of a teams workflow?3
Does this have an effect of our Core Metrics??
Does this align with our strategic goals?3
Is this a blocker for another team?3

Event Timeline

BTullis triaged this task as High priority.Jan 27 2023, 5:27 PM
BTullis raised the priority of this task from High to Unbreak Now!.Jan 31 2023, 9:49 AM
BTullis added subscribers: SNowick_WMF, Mayakp.wiki, mpopov.

Raising the priority of this to unbreak now.

Two more users have reported that this incident affects them: @SNowick_WMF and @mpopov

Users believed to be unaffected are: @BTullis (me) and @Mayakp.wiki

I believe that the errors may be related to the user impersonation feature and how Superset ascertains the user's shell account name from their Mediawiki Developer account name. I'll investigate more.

BTullis renamed this task from Some users' presto queries are no longer working in Superset : matmarex and cmyrick to Some users' presto queries are no longer working in Superset.Jan 31 2023, 10:05 AM
BTullis updated the task description. (Show Details)

Here is what is supposed to happen.

  • User tries to access https://superset.wikimedia.org/
  • Apache picks up the request, requires CAS authentication and redirects the user to https://idp.wikimedia.org/login
  • The user enters their cn value as their username, which may contain spaces and utf8 characters.
  • The cn is downcased and CAS attempts to authenticate to a read-only LDAP replica with the credentials.
  • Upon successful authentication of the password, CAS also verifies membership of any one of the ops,wmf, or nda LDAP groups.
  • At this point, authentication and authorization has succeeded, so the user is redirected back to the original https://superset.wikimedia.org/ URL, but with the requisite headers added.
  • One of these headers id HTTP_X_CAS_UID which is copied to the REMOTE_USER variable here.

Note that we used to use the HTTP_X_REMOTE_USER as the source of the user name before CAS was introduced to both the production and staging instances of Superset. Now we no longer have that configuratiuon on either instance, although there are still fragments and comments relating to that configuration in puppet.

  • Superset logs in the user with the account referred to by REMOTE_USER and creates it if necessary. We know that this is the uid value.
  • When superset attempts to run a presto query it attempts to run it as REMOTE_USER and checks that they are a member of the analytics-privatedata-users` group.

So that explains how it should work, but I haven't yet ascertained why it isn't working for these users.

I've checked with the Infrastructure-Foundations team about possible changes to the CAS-SSO system during the last week, but there is nothing obvious that has changed.

I wonder whether it would be helpful to check what happens if the users log out of CAS and then back in again. Would that perhaps solve the issue?

I've asked users to try logging out of CAS and back in. It seemed to work for @Stevemunene but @matmarex has reported no improvement.

I can see the following entry in the superset logs, 2 seconds after @matmarex logged in.

Jan 31 12:48:09 an-tool1010 superset[10560]: [2023-01-31 12:48:09 +0000] [7] [ERROR] Error handling request
Jan 31 12:48:09 an-tool1010 superset[10560]: Traceback (most recent call last):
Jan 31 12:48:09 an-tool1010 superset[10560]:   File "/srv/deployment/analytics/superset/venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 113, in handle_request
Jan 31 12:48:09 an-tool1010 superset[10560]:     resp.write_file(respiter)
Jan 31 12:48:09 an-tool1010 superset[10560]:   File "/srv/deployment/analytics/superset/venv/lib/python3.9/site-packages/gunicorn/http/wsgi.py", line 385, in write_file
Jan 31 12:48:09 an-tool1010 superset[10560]:     if not self.sendfile(respiter):
Jan 31 12:48:09 an-tool1010 superset[10560]:   File "/srv/deployment/analytics/superset/venv/lib/python3.9/site-packages/gunicorn/http/wsgi.py", line 375, in sendfile
Jan 31 12:48:09 an-tool1010 superset[10560]:     self.sock.sendfile(respiter.filelike, count=nbytes)
Jan 31 12:48:09 an-tool1010 superset[10560]:   File "/srv/deployment/analytics/superset/venv/lib/python3.9/site-packages/gevent/_socket3.py", line 495, in sendfile
Jan 31 12:48:09 an-tool1010 superset[10560]:     return self._sendfile_use_send(file, offset, count)
Jan 31 12:48:09 an-tool1010 superset[10560]:   File "/srv/deployment/analytics/superset/venv/lib/python3.9/site-packages/gevent/_socket3.py", line 425, in _sendfile_use_send
Jan 31 12:48:09 an-tool1010 superset[10560]:     self._check_sendfile_params(file, offset, count)
Jan 31 12:48:09 an-tool1010 superset[10560]:   File "/srv/deployment/analytics/superset/venv/lib/python3.9/site-packages/gevent/_socket3.py", line 470, in _check_sendfile_params
Jan 31 12:48:09 an-tool1010 superset[10560]:     raise ValueError(
Jan 31 12:48:09 an-tool1010 superset[10560]: ValueError: count must be a positive integer (got 0)
<attribute name="authenticationDate">
      <value>2023-01-31T12:48:07.334315Z</value>
</attribute>

I believe that this incident can now be considered resolved. The main problem seems to be that user permissions were missing, but that in previous versions this didn't matter due to bugs in the code.

We ascertained that the users were all trying to run queries from SQL Lab (https://superset.wikimedia.org/superset/sqllab/) that had previously worked for them in version 1.4.2 and prior to that.

Since the upgrade to version 1.5.3 the SQL Lab tab disappeared for users who have the Alpha role. Compare these two images.

image.png (2×3 px, 262 KB)

image.png (2×3 px, 240 KB)

According to the documentation regarding roles, this is as it should be:

However, it seems that all up to and including version 1.4.2 our users who were only in the Alpha role were permitted to run queries in SQL Lab.

If they accessed SQL Lab directly, such as from a bookmark or link, it would work. I'm not exactly sure when the top bar link disappeared for users, but it's only with this update that the permissions on running queries in SQL Lab sems to be correctly applied.

There are a couple of relevant issues/patches in Superset's GitHub repo, such as:

However, now that this permission system is supposedly working as it should, I think we should create a follow-up ticket to review the permissions and assign the sql_lab role wherever it is needed.

BTullis lowered the priority of this task from Unbreak Now! to Medium.Jan 31 2023, 4:03 PM
BTullis moved this task from Incoming (new tickets) to Visualize on the Data-Engineering board.

My SQL Lab on superset has also not been working for the past week or so!

My SQL Lab on superset has also not been working for the past week or so!

Please try again now @Htriedman - apologies for the inconvenience.

Up and running! thanks for the help

My SQL Lab doesn't work either. I tried to log out of superset but the Logout menu brings me back to the same page, still logged-in. I've logged out of idp.wmo and back in.

@awight - I have added the sql_lab role to your account, so it should work now. Apologies for the inconvenience. Please see T328457: Grant all authenticated users access to SQL Lab in Superset for more information about why it happened.

@jwang - Please could you elaborate on what's not working? I've checked and the account that I believe is yours already has the sql_lab role applied. Is there something else about it that isn't working, or can you not access it at all? Thanks.

image.png (284×1 px, 26 KB)

@BTullis , thanks for the followup. My account works now.