Page MenuHomePhabricator

load-datasets.py: Lock wait timeout exceeded; try restarting transaction
Closed, ResolvedPublicBUG REPORT

Description

kharlan@deploy1002:~$ kubectl logs -f linkrecommendation-production-load-datasets-1617051600-kzbcb
== Initializing ==
   Ensuring checksum table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Ensuring anchors table exists...[OK]
   Ensuring redirects table exists...[OK]
   Ensuring pageids table exists...[OK]
   Ensuring w2vfiltered table exists...[OK]
   Ensuring model table exists...[OK]
   Beginning process to load datasets for cswiki, simplewiki, arwiki, viwiki, bnwiki, enwiki, frwiki, elwiki
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for cswiki ==
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/cswiki/lr_cswiki_anchors.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/cswiki/lr_cswiki_redirects.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/cswiki/lr_cswiki_pageids.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/cswiki/lr_cswiki_w2vfiltered.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json...[OK]
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model) for cswiki ==
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Processing dataset: anchors
Traceback (most recent call last):
  File "load-datasets.py", line 351, in <module>
    main()
  File "load-datasets.py", line 301, in main
    "DELETE FROM {tablename}".format(tablename=tablename)
  File "/opt/lib/python/site-packages/pymysql/cursors.py", line 163, in execute
    result = self._query(query)
  File "/opt/lib/python/site-packages/pymysql/cursors.py", line 321, in _query
    conn.query(q)
  File "/opt/lib/python/site-packages/pymysql/connections.py", line 505, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/opt/lib/python/site-packages/pymysql/connections.py", line 724, in _read_query_result
    result.read()
  File "/opt/lib/python/site-packages/pymysql/connections.py", line 1069, in read
    first_packet = self.connection._read_packet()
  File "/opt/lib/python/site-packages/pymysql/connections.py", line 676, in _read_packet
    packet.raise_for_error()
  File "/opt/lib/python/site-packages/pymysql/protocol.py", line 223, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/opt/lib/python/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
     Deleting all values from lr_cswiki_anchors...

It's possible https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/673347 (created to fix T275359: pymysql.err.OperationalError: (1412, 'Table definition has changed, please retry transaction')) is related, that's the only recent change to load-datasets.py

Event Timeline

kostajh triaged this task as High priority.Mar 29 2021, 9:11 PM
kostajh changed the subtype of this task from "Task" to "Bug Report".
kostajh moved this task from Backlog to March 29 - April 2 on the Add-Link board.

This is blocking updates to existing datasets as well as imports of new datasets.

Tagging DBA, if you all have some insights to help us track down the problem, that would be appreciated. Thanks in advance!

jcrespo subscribed.

"Lock wait timeout exceeded; try restarting transaction" means 2 writes are happening at the same time exclusively, blocking the same rows .

As someone that takes care of exporting and importing lots of data for backups, I reiterate my offer I gave you a few weeks ago to have a more permanent/reliable service long term. One option could be to import to a passive server and then switch the active after finishing the imports- this would prevent contention like the one shown here.

Otherwise it is a question of why the importing logic is trying to write to the same set of rows twice and avoid that (it depends on actual queries done). Sometimes it is a question of lack of proper indexing to avoid extra locking.

Is this happening deterministically? ensure_table_exists uses a separate connection and closes it when it finishes so I don't see how it could interfere with locking.

Default MySQL lock wait timeout is 50 sec, which seems too large for being caused by the web service (which doesn't create explicit locks but in serializable isolation mode selects do result in locks which interfere with an unconditional delete, AIUI).

The other option would be that two instances of the import job are running in parallel for some reason and one of them is trying to delete the table while the other is already importing data in it.

Is this happening deterministically? ensure_table_exists uses a separate connection and closes it when it finishes so I don't see how it could interfere with locking.

Yes, it's happening deterministically, although, it looks like cswiki and simplewiki have been updated successfully since I filed this task.

$ ssh deployment.eqiad.wmnet
$ kube_env linkrecommendation eqiad
$ kubectl get pods
linkrecommendation-production-load-datasets-1616997600-8tkcv   0/1     Completed   0          38h
linkrecommendation-production-load-datasets-1617001200-g9bds   0/1     Completed   0          37h
linkrecommendation-production-load-datasets-1617004800-77j4h   0/1     Completed   0          36h
linkrecommendation-production-load-datasets-1617134400-6cqjz   0/1     Error       0          9m53s
linkrecommendation-production-load-datasets-1617134400-sdrv8   0/1     Error       0          7m47s

Default MySQL lock wait timeout is 50 sec, which seems too large for being caused by the web service (which doesn't create explicit locks but in serializable isolation mode selects do result in locks which interfere with an unconditional delete, AIUI).

The other option would be that two instances of the import job are running in parallel for some reason and one of them is trying to delete the table while the other is already importing data in it.

That appears to be what is going on:

kubectl logs linkrecommendation-production-load-datasets-1617134400-6cqjz
== Initializing ==
   Ensuring checksum table exists...[OK]
  [snip]
   Beginning process to load datasets for cswiki, simplewiki, arwiki, viwiki, bnwiki, enwiki, frwiki, elwiki, dewiki
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for cswiki ==
   Checksum in database matches remote checksum, skipping download for anchors
   Checksum in database matches remote checksum, skipping download for redirects
   Checksum in database matches remote checksum, skipping download for pageids
   Checksum in database matches remote checksum, skipping download for w2vfiltered
   Checksum in database matches remote checksum, skipping download for model
   All datasets for cswiki are up-to-date!
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for simplewiki ==
   Checksum in database matches remote checksum, skipping download for anchors
   Checksum in database matches remote checksum, skipping download for redirects
   Checksum in database matches remote checksum, skipping download for pageids
   Checksum in database matches remote checksum, skipping download for w2vfiltered
   Checksum in database matches remote checksum, skipping download for model
   All datasets for simplewiki are up-to-date!
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for arwiki ==
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_anchors.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_redirects.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_pageids.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_w2vfiltered.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/arwiki.linkmodel.json...[OK]
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model) for arwiki ==
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Processing dataset: anchors
Traceback (most recent call last):
  File "load-datasets.py", line 351, in <module>
    main()
  File "load-datasets.py", line 301, in main
    "DELETE FROM {tablename}".format(tablename=tablename)
  File "/opt/lib/python/site-packages/pymysql/cursors.py", line 163, in execute
kubectl logs -f linkrecommendation-production-load-datasets-1617134400-sdrv8
== Initializing ==
   Ensuring checksum table exists...[OK]
   [snip]
   Beginning process to load datasets for cswiki, simplewiki, arwiki, viwiki, bnwiki, enwiki, frwiki, elwiki, dewiki
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for cswiki ==
   Checksum in database matches remote checksum, skipping download for anchors
   Checksum in database matches remote checksum, skipping download for redirects
   Checksum in database matches remote checksum, skipping download for pageids
   Checksum in database matches remote checksum, skipping download for w2vfiltered
   Checksum in database matches remote checksum, skipping download for model
   All datasets for cswiki are up-to-date!
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for simplewiki ==
   Checksum in database matches remote checksum, skipping download for anchors
   Checksum in database matches remote checksum, skipping download for redirects
   Checksum in database matches remote checksum, skipping download for pageids
   Checksum in database matches remote checksum, skipping download for w2vfiltered
   Checksum in database matches remote checksum, skipping download for model
   All datasets for simplewiki are up-to-date!
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for arwiki ==
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_anchors.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_redirects.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_pageids.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/lr_arwiki_w2vfiltered.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/arwiki/arwiki.linkmodel.json...[OK]
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model) for arwiki ==
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Processing dataset: anchors
Traceback (most recent call last):
  File "load-datasets.py", line 351, in <module>
    main()
  File "load-datasets.py", line 301, in main
    "DELETE FROM {tablename}".format(tablename=tablename)
  File "/opt/lib/python/site-packages/pymysql/cursors.py", line 163, in execute

Change 675900 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version, adjust timeout, disable cron in staging

https://gerrit.wikimedia.org/r/675900

Change 675900 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version, adjust timeout, disable cron in staging

https://gerrit.wikimedia.org/r/675900

"Lock wait timeout exceeded; try restarting transaction" means 2 writes are happening at the same time exclusively, blocking the same rows .

As someone that takes care of exporting and importing lots of data for backups, I reiterate my offer I gave you a few weeks ago to have a more permanent/reliable service long term. One option could be to import to a passive server and then switch the active after finishing the imports- this would prevent contention like the one shown here.

@jcrespo thanks for the suggestion. I'm not sure how we'd go about implementing that with our current setup. Our existing pipeline is:

  1. a developer on stat1008 runs a script, run-pipeline.sh in the mwaddlink repo that generates a model and datasets, and exports MySQL dumps for the datasets.
  2. the developer runs publish-datasets.sh in the mwaddlink repo, this copies the model, dataset dumps and checksums to a directory that is rsynced within 30 minutes to https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/
  3. our production deployment of the linkrecommendation service has a cronjob template that is enabled for the eqiad and codfw clusters (it is disabled for the staging and external traffic releases).
  4. The codfw cronjob talks to m2-master.codfw.wmnet and the eqiad job talks to m2-master.eqiad.wmnet
    1. the codfw cronjob errors out because the database is in read-only mode; that's fine
  5. the cronjob runs a script, load-datasets.py. The script looks at analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink to see which wikis have datasets, then compares the checksums with what is stored in the database, if the checksums don't exist locally or don't match, then it downloads and begins the import. The import does a DELETE rather than truncate and an INSERT INTO so that everything can be performed in a single transaction.

So with that setup, how would we implement your proposal? Would we have some settings table in the addlink database that keeps track of which database server (side note: a completely different server? or just a different database on m2-master.eqiad.wmnet?) is active and which is passive? load-datasets.py would then know to run the import on the passive database and then be responsible for updating the settings table to make the updated database be the active one?

Otherwise it is a question of why the importing logic is trying to write to the same set of rows twice and avoid that (it depends on actual queries done). Sometimes it is a question of lack of proper indexing to avoid extra locking.

There is just a single process (load-datasets.py) doing writes. There shouldn't be any other process attempting a write (reads, yes, but not writes) happening during the execution of load-datasets.py. The code that is failing is a simple DELETE FROM query, but it's also erroring on a table with 1,284,362 rows. Would adding a primary key to the tables or doing DELETE FROM in batches of 10,000 help?

Let's focus on the immediate issues first, later on better architecture.

Would adding a primary key to the tables or doing DELETE FROM in batches of 10,000 help?

All tables related to mediawiki development must have primary keys by policy: https://www.mediawiki.org/wiki/MediaWiki_database_policy#Schema_changes
This is because otherwise they will create data inconsistency problems and a maintenance nightmare. They also have benefits for the application logic.

Bad indexing can cause contention. It could also be related to schema changes, which can get blocked by reads due to metadata locking. It is difficult to say without knowing the queries you do- please list all queries you do, in order, including the start and finish of your transactions. Adding a logger.debug() to your query execution function may help with that.

the codfw cronjob errors out because the database is in read-only mode

You should only have 1 point of entry IF using m2- either write on eqiad or codfw, depending on what is the active datacenter for the service. Otherwise, you would get double the writes if they didn't failed, as m2 data is replicated to the other datacenter. Again, it is ok to setup something different outside of m2, and that is why I suggested an active/passive setup (per datacenter, even) so you don't have any downtime when importing. But that is a different conversation.

The main issue with this setup is that you seem to be connecting to a remote datacenter without using TLS. This is not allowed. Please disable cross-dc communication for now and we can later rethink the architecture.

Change 676012 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] [DNM] Debug output from queries

https://gerrit.wikimedia.org/r/676012

Change 676012 abandoned by Kosta Harlan:

[research/mwaddlink@main] [DNM] Debug output from queries

Reason:

https://gerrit.wikimedia.org/r/676012

Let's focus on the immediate issues first, later on better architecture.

Would adding a primary key to the tables or doing DELETE FROM in batches of 10,000 help?

All tables related to mediawiki development must have primary keys by policy: https://www.mediawiki.org/wiki/MediaWiki_database_policy#Schema_changes
This is because otherwise they will create data inconsistency problems and a maintenance nightmare. They also have benefits for the application logic.

OK. Just to clarify, this is not about MediaWiki tables, but tables owned by the link recommendation service. That policy seems specific to tables created in MediaWiki core + extensions.

Bad indexing can cause contention. It could also be related to schema changes, which can get blocked by reads due to metadata locking. It is difficult to say without knowing the queries you do- please list all queries you do, in order, including the start and finish of your transactions. Adding a logger.debug() to your query execution function may help with that.

P15094 is from running this locally. The associated patch that generated the output is here: https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/676012 Note that I halted execution, but the import (both for updates and loading new datasets) runs without issue locally.

the codfw cronjob errors out because the database is in read-only mode

You should only have 1 point of entry IF using m2- either write on eqiad or codfw, depending on what is the active datacenter for the service. Otherwise, you would get double the writes if they didn't failed, as m2 data is replicated to the other datacenter. Again, it is ok to setup something different outside of m2, and that is why I suggested an active/passive setup (per datacenter, even) so you don't have any downtime when importing. But that is a different conversation.

To clarify, there is just one point of entry, and it varies on which datacenter is active, yes.

The main issue with this setup is that you seem to be connecting to a remote datacenter without using TLS. This is not allowed.

Where do you see this?

Please disable cross-dc communication for now and we can later rethink the architecture.

I'm sorry, I don't understand what you'd like us to do here. What cross DC communication are you referring to?

To clarify, there is just one point of entry, and it varies on which datacenter is active, yes.

So then I am missunderstanding, because I understood that codfw was getting write errors (writes were attempted there). In the current configuration, writes should only be attempted at the active datacenter (eqiad, at the time).

I assumed that write from analytics hosts happen across DC. If writes happen locally only within the same datacenter, that solved the TLS issue, but not the double write attempts I mentioned before.

P15094

This is useful, but the cursor stuff is not relevant for the server side of things (that is just how client organizes data). I would like to know if you run commit at some point or use auto_commit mode. I am unsure just with the information given, but it looks as if potentially you are leaving a transaction open (including the locks generated on the data, which would include gaps due to lack of indexing) and then trying another write (it depends if you are using autocommit mode or not). This is just an initial guess, I need more info.

IF you happen to have a different transaction isolation level/binlog format, it would explain why it works locally and not on our production dbs.

To clarify, there is just one point of entry, and it varies on which datacenter is active, yes.

So then I am missunderstanding, because I understood that codfw was getting write errors (writes were attempted there). In the current configuration, writes should only be attempted at the active datacenter (eqiad, at the time).

I assumed that write from analytics hosts happen across DC. If writes happen locally only within the same datacenter, that solved the TLS issue, but not the double write attempts I mentioned before.

To clarify, there are no writes to m2 happening from the analytics hosts. On stat1008, we process some data and import that data into the staging database on stat1008. Then we do a mysqldump command to get the data into gzip files, those are copied to analytics.wikimedia.org where we download them over HTTP in the kubernetes cluster. Then the mwaddlink app, running in kubernetes, is performing writes to the m2 database.

P15094

This is useful, but the cursor stuff is not relevant for the server side of things (that is just how client organizes data). I would like to know if you run commit at some point or use auto_commit mode. I am unsure just with the information given, but it looks as if potentially you are leaving a transaction open (including the locks generated on the data, which would include gaps due to lack of indexing) and then trying another write (it depends if you are using autocommit mode or not). This is just an initial guess, I need more info.

We are using pymysql's default of Autocommit=false https://pymysql.readthedocs.io/en/latest/modules/connections.html, and then calling commit() only after all datasets for a wiki have been processed (several million DELETE FROM and INSERT INTO statements) https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/+/refs/heads/main/load-datasets.py#345

All tables related to mediawiki development must have primary keys by policy: https://www.mediawiki.org/wiki/MediaWiki_database_policy#Schema_changes
This is because otherwise they will create data inconsistency problems and a maintenance nightmare. They also have benefits for the application logic.

FWIW all the tables are basically cache. Data is read from the stats server, processed, and then inserted as key-value data into the tables read by the link recommendation service. On the next import, all data is discarded and replaced by new data extracted from the stats server. All reads are simple key lookups. So consistency or complexity of schema changes is not much of a concern.

FWIW all the tables are basically cache. Data is read from the stats server, processed, and then inserted as key-value data into the tables read by the link recommendation service. On the next import, all data is discarded and replaced by new data extracted from the stats server. All reads are simple key lookups. So consistency or complexity of schema changes is not much of a concern.

Fixing inconsistencies when doing undeterministic writes on replication is, for us (can create loss of redundancy for unrelated services that also use m2) :-). It also helps with extra locking- which is not the problem here, but could help in other cases, and is semi-relevant.

I've reproduced the problem:

connection 1>CREATE TABLE IF NOT EXISTS lr_cswiki_anchors (  `lookup` TEXT NOT NULL,  `value` LONGBLOB NOT NULL,  INDEX `lookup_index` (`lookup`(767))) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
MariaDB [test]> START TRANSACTION;
Query OK, 0 rows affected (0.000 sec)

connection 1> INSERT INTO `lr_cswiki_anchors` VALUES ('!','\x80\x03}q\\0(X\x0c\\0\\0\\0Vyk\xc5\x99i\xc4\x8dn\xc3\xadkq\x01K\x0fX\\n\\0\\0\\0Faktori\xc3\xa1lq\x02K\x03u.');
Query OK, 1 row affected (0.000 sec)

connection 1> INSERT INTO `lr_cswiki_anchors` VALUES ('!!!','\x80\x03}q\\0X\x03\\0\\0\\0!!!q\x01K\x06s.');\n"
Query OK, 1 row affected (0.000 sec)

connection 1> INSERT INTO `lr_cswiki_anchors` VALUES ('sdfsdfdsf', 'sdfsdfsdf');
Query OK, 1 row affected (0.000 sec)

connection 1> COMMIT;
Query OK, 0 rows affected (0.009 sec)

connection 1> START TRANSACTION;
Query OK, 0 rows affected (0.000 sec)

connection 1> DELETE FROM lr_cswiki_anchors;
Query OK, 3 rows affected (0.000 sec)

connection2> INSERT INTO `lr_cswiki_anchors` VALUES ('sdfsdfdsf2', 'sdfsdfsdf2');
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

You are not properly closing your transactions after every write, leaving the transaction open and blocking subsequent writes. The details of why it is happening is something you have to research- most likely, closing a connection without running commit first may leave things without garbage collection being done. Run commit to close a transaction and make sure you unlock the records.

The debugging (ignoring cursor and adding commits) will help you keep track of your open transactions. Consider using explicit opening and closing of transactions to make your life easier, but that is not required.

I can provide server side debugging if necessary (not sure you have SHOW ENGINE INNODB STATUS; privileges to print information of open transactions and its locks, but you will have it locally).

WARNING: CREATE/ALTER/etc. have implicit COMMITS run before and after execution, and that may be misleading you on what is happening (they are not transactional on MySQL).**

FWIW all the tables are basically cache. Data is read from the stats server, processed, and then inserted as key-value data into the tables read by the link recommendation service. On the next import, all data is discarded and replaced by new data extracted from the stats server. All reads are simple key lookups. So consistency or complexity of schema changes is not much of a concern.

Fixing inconsistencies when doing undeterministic writes on replication is, for us (can create loss of redundancy for unrelated services that also use m2) :-). It also helps with extra locking- which is not the problem here, but could help in other cases, and is semi-relevant.

I've reproduced the problem:

connection 1>CREATE TABLE IF NOT EXISTS lr_cswiki_anchors (  `lookup` TEXT NOT NULL,  `value` LONGBLOB NOT NULL,  INDEX `lookup_index` (`lookup`(767))) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
MariaDB [test]> START TRANSACTION;
Query OK, 0 rows affected (0.000 sec)

connection 1> INSERT INTO `lr_cswiki_anchors` VALUES ('!','\x80\x03}q\\0(X\x0c\\0\\0\\0Vyk\xc5\x99i\xc4\x8dn\xc3\xadkq\x01K\x0fX\\n\\0\\0\\0Faktori\xc3\xa1lq\x02K\x03u.');
Query OK, 1 row affected (0.000 sec)

connection 1> INSERT INTO `lr_cswiki_anchors` VALUES ('!!!','\x80\x03}q\\0X\x03\\0\\0\\0!!!q\x01K\x06s.');\n"
Query OK, 1 row affected (0.000 sec)

connection 1> INSERT INTO `lr_cswiki_anchors` VALUES ('sdfsdfdsf', 'sdfsdfsdf');
Query OK, 1 row affected (0.000 sec)

connection 1> COMMIT;
Query OK, 0 rows affected (0.009 sec)

connection 1> START TRANSACTION;
Query OK, 0 rows affected (0.000 sec)

connection 1> DELETE FROM lr_cswiki_anchors;
Query OK, 3 rows affected (0.000 sec)

connection2> INSERT INTO `lr_cswiki_anchors` VALUES ('sdfsdfdsf2', 'sdfsdfsdf2');
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

You are not properly closing your transactions after every write, leaving the transaction open and blocking subsequent writes. The details of why it is happening is something you have to research- most likely, closing a connection without running commit first may leave things without garbage collection being done. Run commit to close a transaction and make sure you unlock the records.

Thanks for the debugging. I'll take a closer look. What should be happening for 5 tables is:

  • START TRANSACTION
  • DELETE FROM
  • INSERT INTO

And COMMIT should only get called when all 5 tables have been updated. We don't want to commit and leave the datasets in an inconsistent state if something fails halfway through.

That is doable, I think. I am thinking maybe some of the disconnections without commits may try to rollback, rather than commit the changes, and that leads to locks. It needs more debugging.

Change 676042 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Add primary key to tables

https://gerrit.wikimedia.org/r/676042

Change 676042 merged by jenkins-bot:

[research/mwaddlink@main] Add primary key to tables

https://gerrit.wikimedia.org/r/676042

@jcrespo could you please grant ALTER privileges to the adminlinkrecommendation user on m2, so that we can alter the tables to add the primary key?

@jcrespo could you please grant ALTER privileges to the adminlinkrecommendation user on m2, so that we can alter the tables to add the primary key?

Request with DBA tag made in T279053: Grant ALTER privileges to adminlinkrecommendation user on m2

Change 676330 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[research/mwaddlink@main] Avoid parallel connections

https://gerrit.wikimedia.org/r/676330

Change 676330 merged by jenkins-bot:

[research/mwaddlink@main] Avoid parallel connections

https://gerrit.wikimedia.org/r/676330

Change 676340 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Begin a new transaction for each dataset import

https://gerrit.wikimedia.org/r/676340

Change 676340 merged by jenkins-bot:

[research/mwaddlink@main] Begin a new transaction for each dataset import

https://gerrit.wikimedia.org/r/676340

Summary of what we've committed and deployed so far:

  • Begin a new transaction (BEGIN) when iterating over each wiki, COMMIT is called after all 5 datasets have had their DELETE FROM and INSERT INTO queries run
  • Avoid keeping a connection open while creating/altering tables
  • Add a primary key to all tables

Pending patches:

  • Use a single MySQL connection for the table creation/alteration work at the beginning of load-datasets.py

I'm not sure any of these will actually resolve the problem, but we'll see once our user has ALTER privileges (T279053).

Change 676459 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] load-datasets: Use a single MySQL connection object

https://gerrit.wikimedia.org/r/676459

Change 676459 merged by jenkins-bot:

[research/mwaddlink@main] load-datasets: Use a single MySQL connection object

https://gerrit.wikimedia.org/r/676459

Change 677059 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[research/mwaddlink@main] Make import schema b/c hack work with dumps with ids in them

https://gerrit.wikimedia.org/r/677059

Change 677059 abandoned by Gergő Tisza:

[research/mwaddlink@main] Make import schema b/c hack work with dumps with ids in them

Reason:

https://gerrit.wikimedia.org/r/677059

Change 677493 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Enable profiling on staging

https://gerrit.wikimedia.org/r/677493

Change 677493 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Enable profiling on staging

https://gerrit.wikimedia.org/r/677493

kostajh changed the task status from Open to Stalled.Apr 8 2021, 8:17 PM
kostajh moved this task from In Progress to QA on the Growth-Team (Sprint 0 (Growth Team)) board.

This is stalled pending the completion of T279427: Republish datasets with primary key ID column included and T279053: Grant ALTER privileges to adminlinkrecommendation user on m2. Hopefully, though, once those two tasks are completed, all that is left to do in this task is a review of the logs of the kubernetes container doing the dataset imports to see that the datasets have been loaded and are up-to-date for all wikis.

That can be done with:

  • ssh to deployment server
  • kube_env linkrecommendation eqiad
  • kubectl get pods
  • kubectl logs -f linkrecommendation-production-load-datasets-1617912000-ptdtz
kostajh changed the task status from Stalled to Open.Apr 13 2021, 10:56 AM

Let's see what happens on the next import attempt.

Made it past the first hurdle:

== Importing datasets (anchors, redirects, pageids, w2vfiltered, model) for cswiki ==
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Processing dataset: anchors
     Deleting all values from lr_cswiki_anchors...[OK]
     Inserting content into lr_cswiki_anchors...[OK]
       670076 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: redirects
     Deleting all values from lr_cswiki_redirects...[OK]
     Inserting content into lr_cswiki_redirects...[OK]
       301707 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: pageids
     Deleting all values from lr_cswiki_pageids...[OK]
     Inserting content into lr_cswiki_pageids...[OK]
       475324 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: w2vfiltered
     Deleting all values from lr_cswiki_w2vfiltered...[OK]
     Inserting content into lr_cswiki_w2vfiltered...[OK]
       431839 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: model
     Inserting link model...[OK]
     Updating stored checksum...[OK]
   Committing...[OK]
   Finished importing for cswiki!
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for simplewiki ==
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/simplewiki/lr_simplewiki_anchors.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/simplewiki/lr_simplewiki_redirects.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/simplewiki/lr_simplewiki_pageids.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/simplewiki/lr_simplewiki_w2vfiltered.sql.gz...[OK]
   Downloading dataset http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/simplewiki/simplewiki.linkmodel.json...[OK]
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model, anchors, redirects, pageids, w2vfiltered, model) for simplewiki ==
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Verifying file and checksum exists for anchors...[OK]
   Verifying checksum for anchors...[OK]
   Verifying file and checksum exists for redirects...[OK]
   Verifying checksum for redirects...[OK]
   Verifying file and checksum exists for pageids...[OK]
   Verifying checksum for pageids...[OK]
   Verifying file and checksum exists for w2vfiltered...[OK]
   Verifying checksum for w2vfiltered...[OK]
   Verifying file and checksum exists for model...[OK]
   Verifying checksum for model...[OK]
   Processing dataset: anchors
     Deleting all values from lr_simplewiki_anchors...[OK]
     Inserting content into lr_simplewiki_anchors...[OK]
       147884 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: redirects
     Deleting all values from lr_simplewiki_redirects...[OK]
     Inserting content into lr_simplewiki_redirects...[OK]
       69424 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: pageids
     Deleting all values from lr_simplewiki_pageids...[OK]
     Inserting content into lr_simplewiki_pageids...[OK]
       184315 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: w2vfiltered
     Deleting all values from lr_simplewiki_w2vfiltered...[OK]
     Inserting content into lr_simplewiki_w2vfiltered...[OK]
       146481 rows inserted
     Updating stored checksum...[OK]
   Processing dataset: model
     Inserting link model...[OK]
     Updating stored checksum...[OK]
   Processing dataset: anchors
     Deleting all values from lr_simplewiki_anchors...[OK]

It's just very slow, I suppose that's to be expected. Took about 30 minutes for the lr_cswiki_anchors table to refresh. And we have 9 more wikis * 4 tables for each wiki to update, so hopefully by sometime tomorrow the datasets will be done updating.

All datasets are done importing.

kostajh added a subscriber: Etonkovidova.

(@Etonkovidova there was nothing here to QA really so I thought I'd close it.)