Page MenuHomePhabricator

Platform Metrics are failing to update
Closed, ResolvedPublic

Description

Metrics haven't been updated since August 17th.

Looks maybe like it is failing with
production.ERROR: SQLSTATE[HY000]: General error: 1615 Prepared statement needs to be re-prepared {"exception":"[object] (PDOException(code: HY000): SQLSTATE[HY000]: General error: 1615 Prepared statement needs to be re-prepared at /var/www/html/app/Jobs/PlatformStatsSummaryJob.php:191 see this log

Event Timeline

Didn't see any obvious deploys happening around August 17th. I wonder if this prepared statement just got too long or something? Is there some cache size we need to bump?

Thanks!

You think we should just bump the number again? I'm assuming it is due to how many tables this prepared statement emulation opens in one go? We already doubled it last time, less than a year ago. If in understand right it went from 2000 to 4096 just 10 months ago?

I investigated this some more.

There was a concern from Andrew that for some reason the table_open_cache hadn't actually been increased from 2000 to 4096 as per this PR from T348842 . However, it seems to me on further investigation that we were confused by the discrepancy between the primary and secondary. From the mariadbs (on staging):

MariaDB [(secondary)]> SHOW VARIABLES LIKE 'table_%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| table_definition_cache     | 400   |
| table_open_cache           | 2000  |
| table_open_cache_instances | 8     |
+----------------------------+-------+
MariaDB [primary]> SHOW VARIABLES LIKE 'table_%'
    -> ;
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| table_definition_cache     | 4096  |
| table_open_cache           | 4096  |
| table_open_cache_instances | 8     |
+----------------------------+-------+

As we can see the PR only increased this setting on the primary from this line.

We also checked this on production and saw that this is set to 4096 and then currently open tables (on the primary) is 4096.

MariaDB [(primary)]> SHOW VARIABLES LIKE 'table_%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| table_definition_cache     | 4096  |
| table_open_cache           | 4096  |
| table_open_cache_instances | 8     |
+----------------------------+-------+
3 rows in set (0.001 sec)

MariaDB [(primary)]> SHOW GLOBAL STATUS LIKE 'open_%'
    -> ;
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| Open_files               | 27    |
| Open_streams             | 4     |
| Open_table_definitions   | 4096  |
| Open_tables              | 4096  |
| Opened_files             | 64066 |
| Opened_plugin_libraries  | 0     |
| Opened_table_definitions | 52538 |
| Opened_tables            | 52564 |
| Opened_views             | 0     |
+--------------------------+-------+

We're pretty sure that after August 17th the number of total wikis rose above 1366 which would then mean more than 4096 concurrently opened tables assuming this query opens 3 tables per wiki (even for deleted wikis)

image.png (333×680 px, 46 KB)

We (@dang and @Deniz_WMDE) are now trying to determine what would be a safe number to raise this to. Tom thinks this is a sign of a badly written query and that concurrently opening all these tables is probably not sustainable in the long run but we talked about being unsure how is best to refactor this. Tom also mentioned that he was in favour of getting this type of logic removed from the API and chatted to Rosalie and Andrew about this just yesterday but he didn't see this exact problem as a reason to do it.

I bumped table_open_cache and table_definition_cache to 5120 (2^12+2^10) but I don't think it's a sustainable solution.

https://github.com/wmde/wbaas-deploy/pull/1765

Tarrow claimed this task.