Page MenuHomePhabricator

Reduce stress caused by IABot on the DB
Closed, ResolvedPublic


As it stands, Cyberbot is incredibly efficient with the API, IA API, memory, and speed, but it is heavily dependent on its local DB. While the read operations aren't as expensive, and are done in one go, write operations are executed in a single query per row. Cyberbot has already received some updates to better detect when a write operation is needed but it is still pretty stressful. Cyberbot's write operation should be more streamlined to reduce the stress without dropping functionality, efficiency, and performance.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2016, 7:40 PM

@Volans I deployed an update just now, that should help. Let me know if the stress has started to go down. I would like to bring the workers back up to 28 again, and see if the stress is reduced.

Volans added a subscriber: jcrespo.May 1 2016, 10:39 PM

@Cyberpower678: this change will affect the usage of tool labs DB to avoid what happened on T131937?
What has been the change?

Adding @jcrespo and DBA for reference.

@Cyberpower678: this change will affect the usage of tool labs DB to avoid what happened on T131937?
What has been the change?

Adding @jcrespo and DBA for reference.

It should, at the very least reduce the stress.

One thing I changed, is the detection of when entries need to be UPDATEd. That should reduce the number of rows needing updates. The second major thing is that instead of writing to 1 row for every query, I'm INSERTing, UPDATEing, and DELETEing all the rows in 3 separate queries.

Volans added a comment.May 3 2016, 9:24 AM

How many rows are all the rows ?

How many rows are all the rows ?

Well it depends. When Cyberbot analyzes an article it does a SELECT to batch fetch anticipated, and does subsequent SELECTs for data it missed.

For data that's not in the DB yet, it creates an INSERT.
For data that's been altered, it does an UPDATE.
For data that was fetched but not used, it does a DELETE.

Formerly, every INSERT, UPDATE, and DELETE command used 1 query for every row updated/deleted/inserted.

The number or rows being manipulated is dependent on the article it analyzes. Every row is a link Cyberbot has analyzed in an article, as such larger articles can have up to 200+ rows in a single article, while smaller articles may have significantly less.

The solution I have implemented as a first step is to lump all INSERTS into one query, all UPDATES into one query, and all DELETES into one query, which means there aren't more than 3 write queries per article. Cyberbot won't execute and INSERT/UPDATE/DELETE command if there is no respective row to INSERT/UPDATE/DELETE. Having researched this it should drop load on the CPU. You should have already noticed a small decline in Cyberbot's usage.

Volans added a subscriber: bd808.May 3 2016, 11:27 PM

Batching some of the operations (but not too many) could give some help on the master although I don't think on the slave(s) given the ROW based replication.
Making uncontrolled batches (depending on the page contents, not the bot configuration) might also incur into other limitations like max_allowed_packet or others.

Are you confident those changes will not overload tools labs DBs again?
I cannot see any improvement since May 1st, but at current concurrency IABot was not creating issues, hence is hard to see an impact given that those are shared resources.

We could perform a test but if it will overload again you'll have to reduce the workers and/or we'll have to put back the throttle.

I'd like to involve also Community-Tech-Tool-Labs (CCing @bd808), here's a quick summary:

  • IABot had an high impact on Tools Labs DBs ~1 month ago after increasing it's workers, see T131937
  • @Cyberpower678 has a new version with some improvements on the way the DB is used, although the amount of data written is more or less the same (correct me if I misunderstood)
  • Given the requirements it might need dedicated resources or evaluate alternative options/approaches

Could you have a look?

bd808 added a comment.May 3 2016, 11:52 PM

Code changes mentioned by @Cyberpower678 seem to be

The change looks to have functionally changed from using many individual INSERT, UPDATE and DELETE queries to one batched version of each query. The ultimate execution is still done in one mysqli_multi_query() call that runs all of the calls as one batch of work. This may lighten the processing overhead slightly on the master, but as @Volans notes with ROW based replication the slave load will not be significantly minimized as the changes are propagated. Any real savings on the DB side would likely be from the removal of unnecessary UPDATE calls for unchanged links.

If we were writing something like this as a maintenance script to execute on the WMF production cluster it would implement an internal batch limiting step to ensure that at most N rows were changed before checking to ensure that replication has caught up (i.e. wfGetLBFactory()->waitForReplication()).

I would like to perform a test to see how much of a difference this change has on the DB. I will be responsive in the event I need to reduce the workers again.

@Volans can I move forward with the testing?

Volans added a comment.May 5 2016, 7:43 AM

@Cyberpower678 I'm available today on EU time, if you ping me on IRC (volans) I can keep en eye on Tool Labs DBs while you perform the test.

Cyberpower678 added a comment.EditedMay 5 2016, 3:18 PM

@Volans I don't see you online.

Volans added a comment.May 5 2016, 5:15 PM

@Cyberpower678 sorry I missed this notification, but you found me anyway :)

Quick summary

We did the test running with full parallelism and we had then to put it back to it's original concurrency because was not sustainable from the DB point of view:

In my opinion there are three possible ways here:

  1. find a completely different approach to achieve the same goal, that does not require the same amount of data to be read/written
  2. have some dedicated hardware where to move it's DB with enough memory to fit it's indexes
  3. optimize it and make it work with reduced concurrency so that it doesn't affect too much Tool Labs DB (i.e. few workers that cycle the sharded letters one at a time sequentially)

As I said in T131937#2191339 there are some optimizations on the index part, but the problem today seemed to be mostly data written that has to update also all the indexes and then data read with indexes probably not fitting into memory.
Here's some possible index optimizations I've already suggested:

  • there is a redundant index on url, but it will save space only on disk, if not used is not in memory
  • there is a unique key on urlvarchar(767) but is also the first part of the primary key, why it's not directly the primary key?
  • the primary key is very long and has a varchar(767). Consider having some hash of it in the index instead. The primary key is repeated for each secondary index in MySQL.

P.S. @Cyberpower678 you should have received a notification from phab for a paste, if not feel free to ping me and I'll give you the link.

Well it would seem I have discovered that some INSERTS are happening despite the data being there, and some UPDATES are happening when there is no need.

Clearly something wonky is happening here. Some rows are not being fetched when they should. I'm not sure if it's a problem with the DB or Cyberbot.

As for updates, definitely Cyberbot.

The DB will eventually need to be moved to dedicated hardware, there's no getting around that, and it should be done sooner rather than later. The DB is only going to get bigger, and hammered on more.

Regardless of all that, I will work on other ways to make the current method better.

DannyH moved this task from Estimated to To Be Estimated/Discussed on the Community-Tech board.

I've done some tweaks and fixes. The overall query count should go down by a lot now. I have completely removed duplicate inserts, and have fixed a few bugs that were flagging rows needing to be unnecessarily updated. I'm running a test now, and monitoring graphana to see how much of an impact the bot is making.

With the new DB structure, it seems to not be increasing the load at all with 28 workers. Will leave running for a bit to see if it causes problems in the long run.

With the new DB structure agreed with @Cyberpower678 the size on disk of the tables+indexes went from 37GB to just under 10GB and the most common query has a clean explain (just Using where).

@Cyberpower678 re-started the bot with 28 processes around 17:50 UTC today and so far I don't see any big impact on the DB. This also thanks to a bugfix that reduced a lot the number of insert/update operations that the bot was doing before. Assuming of course the the bot is working as expected :-)

Cyberpower678 closed this task as Resolved.May 24 2016, 8:55 PM

Since DB usage is reasonably stable, and no noticable side effects of the changes have surfaced, I'm closing this resolved.

Luke081515 moved this task from Backlog to Done on the Community-Tech-Tool-Labs board.