Page MenuHomePhabricator

Improve performance of the Edit List
Closed, DeclinedPublic

Description

Per discussion at T194707#4449342 and elsewhere, the Edit List feature is very inefficient and does not scale.

First and foremost, it should be using the job queue like all the other queries are. This would involve a UI overhaul of doing everything over AJAX. So you'd load the page, and there'd be text "Loading...", that would be replaced with the data once the queries finish.

Currently we get everything with one giant query. This consists of a subquery to get revisions for each individual wiki, UNIONed together, then sorted by timestamp. We could break them out into separate queries, but then you can't sort by timestamp, and LIMIT and OFFSETs for pagination isn't easily achievable.

My thoughts are to enforce viewing revisions on a per-wiki basis. So at the top there will be a dropdown of each wiki, and you must select only one.

Event Timeline

So you'd load the page, and there'd be text "Loading...", that would be replaced with the data once the queries finish.

What do you think about giving them the "Loading..." on the event page and redirecting once we've finished generating the data? I've seen a few users seems confused when they click "View All Data" and nothing happens. Not everyone seems to pay attention to the browser loading indicator.

My thoughts are to enforce viewing revisions on a per-wiki basis. So at the top there will be a dropdown of each wiki, and you must select only one.

This sounds like a good idea to me. But we should think about what the behavior should be when the user wants to download CSV or wikitable. And how do we convey the behavior effectively to the user.

For context, could someone share a link to what the revision browser is?

@aezell Yes a link would have been helpful! Ha. Try this https://tools.wmflabs.org/grantmetrics/programs/2018_Black_Lunch_Table/BLT_hosts_Art_and_Feminism_2018_@_Triangle/revisions That is for a smaller event that is only on one wiki, so it will load quickly.

The revision browser is accessible from the event page by clicking on the "View all data" button at the top-right.

What do you think about giving them the "Loading..." on the event page and redirecting once we've finished generating the data? I've seen a few users seems confused when they click "View All Data" and nothing happens. Not everyone seems to pay attention to the browser loading indicator.

I was suggesting we'd bring them to the revision browser page, and that page says "Loading..." and fetches the data via AJAX. It should be clear what's going on, and it might even "feel" faster, even though it won't be.

@MusikAnimal Thanks for the link!

I was just looking here: https://github.com/wikimedia/grantmetrics/blob/master/src/AppBundle/Repository/EventRepository.php#L266

That seems to potentially build a really large SQL query. Have we considered running each of those sub-queries separately and then storing the results in redis[1] (which I see is already being used for caching)? Then we could combine all the results with a single command to redis to pull back all the elements in a list at a specific redis key. We could maybe even use redis to sort the elements on insertion depending on what datatype we use and how we create the keys. If we made sure to delete the data at the key, it's very cheap for redis to handle that kind of transient data.

The idea being that we do a bunch of faster queries and just shelve the data rather than expecting the DB to do a huge list of sub-selects. Then, we combine it when we pull it back.

The best way to figure out what's happening with the current queries would be to dump some example queries and run them with an EXAMINE to see the cost. I could be wrong with the idea that the DB cost is what makes the page slow for some sets of data.

[1] We could keep the results in memory but you might run into issues if there's a lot of data to chew through.

Yeah it IS a really large query, haha. That was done so we can sort my timestamp and do pagination with LIMIT and OFFSET.

Would we be able to sort by arbitrary fields through Redis (specifically timestamp), without pulling all that data back into memory? I also worry about storing that much data in Redis to begin with. The number of revisions could be in the tens of thousands, if not hundreds of thousands.

The best way to figure out what's happening with the current queries would be to dump some example queries and run them with an EXAMINE to see the cost. I could be wrong with the idea that the DB cost is what makes the page slow for some sets of data.

Unfortunately we can't run EXPLAIN on Toolforge queries, without opening another connection to the same server and running EXPLAIN on the process ID. I built a tool to do this for you https://tools.wmflabs.org/sql-optimizer but it doesn't support queries with UNIONs.

We can examine a single subquery to get an idea of what's going on. This example actually runs fairly fast, around ~5 sec runtime, but UNION that with another 250+ wikis (we have to query each wiki that the participants have collectively edited at), and it gets quite slow. If you have any ideas on how to improve this subquery in particular, do tell! :)

I think there's a way to do this in Redis with Sorted Sets. Specifically, that data type stores a string associated with a score. The string could be a JSON representation or PHP serialization of a result row from the DB. Each value's score could just be epoch time. That means that the whole data structure would be ordered by this score, which for us is the timestamp as epoch time. Effectively, in redis you would end up with all the data you are currently getting in an ordered data structure.

Then, you could use commands like [[ https://redis.io/commands/zrange | ZRANGE ]] or [[ https://redis.io/commands/zrangebyscore | ZRANGEBYSCORE ]] to grab the page of data that you want.

As far as getting the data in, you could add the items one at a time with [[ https://redis.io/commands/zadd | ZADD ]] but there are more interesting ways like [[ https://redis.io/commands/zinterstore | ZINTERSTORE ]] and [[ https://redis.io/commands/zunionstore | ZUNIONSTORE ]] because they might make it pretty easy to drop in the results of each subquery into the larger data set.

I'd have to play it with some to see if this idea would actually work but it seems possible.

Eventually, we'd want to remove all this data and removing large data objects in redis is a situation. I know that we are using redis as a caching layer now so maybe this deletion is somewhat rare but you can't do it with the auto-expiring objects like you can with simple key/values in redis.

What do you think about giving them the "Loading..." on the event page and redirecting once we've finished generating the data? I've seen a few users seems confused when they click "View All Data" and nothing happens. Not everyone seems to pay attention to the browser loading indicator.

I was suggesting we'd bring them to the revision browser page, and that page says "Loading..." and fetches the data via AJAX. It should be clear what's going on, and it might even "feel" faster, even though it won't be.

I'd lean to whichever is the easier to do - if showing Loading... on the event page and then redirecting to the events page is easier, let's do that. If using AJAX on the revision browser page is easier, let's do that. I agree it'll feel a bit faster if we use AJAX but it's not a compelling enough reason to do it if it's a lot more work.

@MusikAnimal I think we should break this ticket down into at least two:

  • Using job queue for revision browser
  • Changing the revision browser to work on a per-wiki basis

Thoughts?

MusikAnimal renamed this task from Improve performance of the revision browser to Improve performance of the Edit List.Jan 16 2019, 3:22 AM
MusikAnimal updated the task description. (Show Details)
MusikAnimal added a project: Event Metrics.

This specifically is about Event Metrics, so I've re-added the tag. I am going to close as declined, though, since we've unanimously decided we're not worrying about performance just yet, and things are better now anyway since we started using page IDs.

First and foremost, it should be using the job queue like all the other queries are. This would involve a UI overhaul of doing everything over AJAX. So you'd load the page, and there'd be text "Loading...", that would be replaced with the data once the queries finish.

This is still relevant but is covered by some of the newer tasks.