Page MenuHomePhabricator

Quarry suggests invalid database names, and doesn't suggest some valid database names
Open, Needs TriagePublic

Description

While the recent implementation of database name suggestion in T287471 is helpful, it has numerous invalid entries. As mentioned in T287471#7300959, this doesn't do much validation, and so it suggests things like "commonwiki_p" instead of "commonswiki_p". Additionally, it won't suggest e.g. iuwiki_p, even though that's a perfectly valid database name.

With the new database structure, there's no obvious way of finding all of the valid database names. Running SELECT dbname from wiki; on meta_p gives all the wiki databases, but leaves out those such as meta_p itself. In addition to the wikis themselves, there are at least:

  • meta_p
  • centralauth_p
  • information_schema
  • information_schema_p
  • heartbeat_p

though I don't believe this is exhaustive.

Event Timeline

With the new database structure, there's no obvious way of finding all of the valid database names.

The replicas are still organized in sections, so it's still possible to enumerate all available databases. You just have to run SHOW DATABASES for every host in s[1-8].analytics.db.svc.wikimedia.cloud. Probably would make sense to cache that somewhere and update it every so often. Of course, if the section organization changes, this method would break.

The other option would be to query something like all.dblist and maintain a separate list of non-wiki databases (meta_p, heartbeat_p, centralauth_p). That has the disadvantage of having to keep that list up-to-date if it ever changes.

I'm not sure which method would be better in the long run, but all.dblist is the "more supported" method.

information_schema should not be included, as querying it (apparently) causes problems. information_schema_p is also disabled in Quarry, because QueryRevision.is_allowed doesn't distinguish between the two.

heartbeat_p actually shouldn't be included in the list, because with the multi-instance databases it only includes lag for the current slice. So you need to connect to a different database and then query heartbeat_p.heartbeat. So it's only meta_p and centralauth_p that need to be added. So all.dblist (or maybe open.dblist, or all.dblist - private.dblist) seems like the better option

the wiki replicas exclude private + closed wikis so you'd need open - private

All of the invalid database names can be removed from the suggestions if we limit them to only include db names against which successful queries have been run in the past. This doesn't seem to significantly increase the internal query runtime.

An ideal solution probably would be to maintain the list of valid dbnames through a cronjob.

With PR-24 closed, should this task be closed as resolved?