As it stands, Cyberbot is incredibly efficient with the API, IA API, memory, and speed, but it is heavily dependent on its local DB. While the read operations aren't as expensive, and are done in one go, write operations are executed in a single query per row. Cyberbot has already received some updates to better detect when a write operation is needed but it is still pretty stressful. Cyberbot's write operation should be more streamlined to reduce the stress without dropping functionality, efficiency, and performance.
|Open||None||T119601 Certain tools users create multiple long running queries that take all memory and/or CPU from labsdb hosts, slowing it down and potentially crashing (tracking)|
|Resolved||None||T131937 Throttling Cyberbot tool user as it is consuming most of the CPU|
|Open||Cyberpower678||T120433 Migrate dead external links to archives|
|Resolved||Cyberpower678||T133584 Reduce stress caused by IABot on the DB|
It should, at the very least reduce the stress.
One thing I changed, is the detection of when entries need to be UPDATEd. That should reduce the number of rows needing updates. The second major thing is that instead of writing to 1 row for every query, I'm INSERTing, UPDATEing, and DELETEing all the rows in 3 separate queries.
Well it depends. When Cyberbot analyzes an article it does a SELECT to batch fetch anticipated, and does subsequent SELECTs for data it missed.
For data that's not in the DB yet, it creates an INSERT.
For data that's been altered, it does an UPDATE.
For data that was fetched but not used, it does a DELETE.
Formerly, every INSERT, UPDATE, and DELETE command used 1 query for every row updated/deleted/inserted.
The number or rows being manipulated is dependent on the article it analyzes. Every row is a link Cyberbot has analyzed in an article, as such larger articles can have up to 200+ rows in a single article, while smaller articles may have significantly less.
The solution I have implemented as a first step is to lump all INSERTS into one query, all UPDATES into one query, and all DELETES into one query, which means there aren't more than 3 write queries per article. Cyberbot won't execute and INSERT/UPDATE/DELETE command if there is no respective row to INSERT/UPDATE/DELETE. Having researched this it should drop load on the CPU. You should have already noticed a small decline in Cyberbot's usage.
Batching some of the operations (but not too many) could give some help on the master although I don't think on the slave(s) given the ROW based replication.
Making uncontrolled batches (depending on the page contents, not the bot configuration) might also incur into other limitations like max_allowed_packet or others.
Are you confident those changes will not overload tools labs DBs again?
I cannot see any improvement since May 1st, but at current concurrency IABot was not creating issues, hence is hard to see an impact given that those are shared resources.
We could perform a test but if it will overload again you'll have to reduce the workers and/or we'll have to put back the throttle.
- IABot had an high impact on Tools Labs DBs ~1 month ago after increasing it's workers, see T131937
- @Cyberpower678 has a new version with some improvements on the way the DB is used, although the amount of data written is more or less the same (correct me if I misunderstood)
- Given the requirements it might need dedicated resources or evaluate alternative options/approaches
Could you have a look?
Code changes mentioned by @Cyberpower678 seem to be https://github.com/cyberpower678/Cyberbot_II/commit/3186fe4a40e9ae35748c178a3609d78dc3950a02
The change looks to have functionally changed from using many individual INSERT, UPDATE and DELETE queries to one batched version of each query. The ultimate execution is still done in one mysqli_multi_query() call that runs all of the calls as one batch of work. This may lighten the processing overhead slightly on the master, but as @Volans notes with ROW based replication the slave load will not be significantly minimized as the changes are propagated. Any real savings on the DB side would likely be from the removal of unnecessary UPDATE calls for unchanged links.
If we were writing something like this as a maintenance script to execute on the WMF production cluster it would implement an internal batch limiting step to ensure that at most N rows were changed before checking to ensure that replication has caught up (i.e. wfGetLBFactory()->waitForReplication()).
@Cyberpower678 sorry I missed this notification, but you found me anyway :)
We did the test running with full parallelism and we had then to put it back to it's original concurrency because was not sustainable from the DB point of view:
- Tool Labs DB went from a loadavg of ~4 up to ~20 to stabilize itself around ~14, with I/O waits ~50%, see https://grafana.wikimedia.org/dashboard/db/server-board?from=1462459966791&to=1462465679917&var-server=labsdb1005&var-network=eth0
- the Innodb_pages_written and Innodb_pages_read increased by a factor of 3 (not spike, stable at 3x). The writes being the ones that impacted most the DB performances
In my opinion there are three possible ways here:
- find a completely different approach to achieve the same goal, that does not require the same amount of data to be read/written
- have some dedicated hardware where to move it's DB with enough memory to fit it's indexes
- optimize it and make it work with reduced concurrency so that it doesn't affect too much Tool Labs DB (i.e. few workers that cycle the sharded letters one at a time sequentially)
As I said in T131937#2191339 there are some optimizations on the index part, but the problem today seemed to be mostly data written that has to update also all the indexes and then data read with indexes probably not fitting into memory.
Here's some possible index optimizations I've already suggested:
- there is a redundant index on url, but it will save space only on disk, if not used is not in memory
- there is a unique key on urlvarchar(767) but is also the first part of the primary key, why it's not directly the primary key?
- the primary key is very long and has a varchar(767). Consider having some hash of it in the index instead. The primary key is repeated for each secondary index in MySQL.
P.S. @Cyberpower678 you should have received a notification from phab for a paste, if not feel free to ping me and I'll give you the link.
Well it would seem I have discovered that some INSERTS are happening despite the data being there, and some UPDATES are happening when there is no need.
Clearly something wonky is happening here. Some rows are not being fetched when they should. I'm not sure if it's a problem with the DB or Cyberbot.
As for updates, definitely Cyberbot.
The DB will eventually need to be moved to dedicated hardware, there's no getting around that, and it should be done sooner rather than later. The DB is only going to get bigger, and hammered on more.
Regardless of all that, I will work on other ways to make the current method better.
I've done some tweaks and fixes. The overall query count should go down by a lot now. I have completely removed duplicate inserts, and have fixed a few bugs that were flagging rows needing to be unnecessarily updated. I'm running a test now, and monitoring graphana to see how much of an impact the bot is making.
With the new DB structure agreed with @Cyberpower678 the size on disk of the tables+indexes went from 37GB to just under 10GB and the most common query has a clean explain (just Using where).
@Cyberpower678 re-started the bot with 28 processes around 17:50 UTC today and so far I don't see any big impact on the DB. This also thanks to a bugfix that reduced a lot the number of insert/update operations that the bot was doing before. Assuming of course the the bot is working as expected :-)