Page MenuHomePhabricator

HELP! Database is getting Slow: A test which took less than 10 min, now it takes 3 hours. I cannot do my experiments.
Closed, ResolvedPublic

Description

Hello,

I am opening this thread as a suggestion from the IRC channel Wikimedia Labs, where I exposed the problem I encountered.

I told them:
"I was just running an script I made in march-april which runs into several wp languages databases, separately and at the same time using "union all".
I am surprised because I notice it is much slower than before. What have happened? And, could u speed it up somehow by changing some user parameters? Otherwise it is impossible to run the tests..."

Actually, it took three hours for a test it used to run with less than 10 min. I used gnwiki (only 700 articles), which is my small test wiki. I see it would be impossible to test middle or big wikis...

My code does run several wikis for a list of articles and checks its existance in other articles and how editors edit them. I study multilingual behavior.

List of queries, for example:

'SELECT DISTINCT rev_user_text, count(*) FROM '+lang+'_p.revision WHERE rev_page = (SELECT page_id FROM '+lang+'_p.page WHERE page_title = %s AND page_namespace=0 AND page_is_redirect=0) GROUP BY rev_user_text'

'SELECT rev_user_text, COUNT(*) FROM revision INNER JOIN page ON rev_page=page_id WHERE page_namespace=0 AND page_is_redirect=0 GROUP BY rev_user_text ORDER BY 2 DESC'

Checking number of langlinks...
'SELECT ll_lang, COUNT(*) FROM page INNER JOIN langlinks ON ll_from=page_id WHERE page_id IN (SELECT page_id FROM u3532__.'+originarylang+table+') AND page_is_redirect=0 AND page_namespace=0 GROUP by 1 ORDER BY 2 DESC'

Checking the names of an article in several languages through langlinks...
'SELECT ll_lang,ll_title FROM langlinks WHERE ll_from = %s'

Checking "edit_count" in multiple language editions for a user with the same name.

    query = query + 'SELECT "'+language+'",user_editcount FROM '+language+'_p.user WHERE user_name LIKE %s '
    if count < len(languagelist): query = query + 'UNION ALL '
query = query + 'ORDER BY user_editcount DESC'

This last one, when does UNION ALL for 10 wiki databases is where it struggles most.
What could you do to improve the system or at least go back to the configuration which was working in march-april?

Thank you very much.

Marc Miquel

Event Timeline

marcmiquel raised the priority of this task from to Needs Triage.
marcmiquel updated the task description. (Show Details)
marcmiquel added a subscriber: marcmiquel.
mysql:wikiadmin@db1052 [enwiki]> EXPLAIN SELECT DISTINCT rev_user_text, count(*) FROM revision WHERE rev_page = (SELECT page_id FROM page WHERE page_title = 'Paris' AND page_namespace=0 AND page_is_redirect=0) GROUP BY rev_user_text;
+------+-------------+----------+-------+--------------------------------------------+------------+---------+-------------+-------+----------------------------------------------+
| id   | select_type | table    | type  | possible_keys                              | key        | key_len | ref         | rows  | Extra                                        |
+------+-------------+----------+-------+--------------------------------------------+------------+---------+-------------+-------+----------------------------------------------+
|    1 | PRIMARY     | revision | ref   | PRIMARY,page_timestamp,page_user_timestamp | PRIMARY    | 4       | const       | 35912 | Using where; Using temporary; Using filesort |
|    2 | SUBQUERY    | page     | const | name_title,page_redirect_namespace_len     | name_title | 261     | const,const |     1 |                                              |
+------+-------------+----------+-------+--------------------------------------------+------------+---------+-------------+-------+----------------------------------------------+
2 rows in set (0.05 sec)

mysql:wikiadmin@db1052 [enwiki]>

Using temporary and using filesort would generally be a good sign of a poorly indexed query, that is likely to be slow

mysql:wikiadmin@db1052 [enwiki]> explain SELECT rev_user_text, COUNT(*) FROM revision INNER JOIN page ON rev_page=page_id WHERE page_namespace=0 AND page_is_redirect=0 GROUP BY rev_user_text ORDER BY 2 DESC;
+------+-------------+----------+------+------------------------------------------------+-----------------------------+---------+---------------------+---------+----------------------------------------------+
| id   | select_type | table    | type | possible_keys                                  | key                         | key_len | ref                 | rows    | Extra                                        |
+------+-------------+----------+------+------------------------------------------------+-----------------------------+---------+---------------------+---------+----------------------------------------------+
|    1 | SIMPLE      | page     | ref  | PRIMARY,name_title,page_redirect_namespace_len | page_redirect_namespace_len | 5       | const,const         | 9791464 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | revision | ref  | PRIMARY,page_timestamp,page_user_timestamp     | PRIMARY                     | 4       | enwiki.page.page_id |      13 |                                              |
+------+-------------+----------+------+------------------------------------------------+-----------------------------+---------+---------------------+---------+----------------------------------------------+
2 rows in set (0.01 sec)

mysql:wikiadmin@db1052 [enwiki]>

Have you tried using revision_userindex instead of revision?

Even though the enwiki_p revision table is enormous, there's a handy index on rev_user_text so that GROUP BYs are actually less horrible than you'd think. That said, on larger wikis, it's still a lot of data. Queries against enwiki_p involving revision/revision_userindex should take about 70 minutes or so, in my experience. If you're seeing worse results than that, there's probably a problem.

Looking at P997, it seems like this is a matter of counting entries in the revision table versus relying on the count in user.user_editcount?

Thanks for checking MZMcBride!

MZMcBride the most costful operation is the UNION ALL query. However, I
posted the other query (in which I check the number of edits per one
article per editor) because I am using it, but not as many as the UNION
ALL.

The main idea of my code is to obtain data from editors across languages.
That's why I am ranking number of edits per editor, to see how active they
are in different wikis.

This gets extremely complex because I need to check many wikis at the same
time. My concern is that the code was running smooth taking few minutes
(less than twenty) and now it takes 3h. Something might have changed...it's
the same code.

Marc

El dj., 16 jul. 2015 a les 22:00, MZMcBride (<
no-reply@phabricator.wikimedia.org>) va escriure:

MZMcBride added a comment.

Looking at https://phabricator.wikimedia.org/P997, it seems like this is
a matter of counting entries in the revision table versus relying on the
count in user.user_editcount?

TASK DETAIL

https://phabricator.wikimedia.org/T105964

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: jcrespo, MZMcBride
Cc: MZMcBride, Steinsplitter, Reedy, marcmiquel, Aklapper, Krenair,
Malyacko, P.Copp

jcrespo removed a project: DBA.

Nope, closed. Thank you

În Mie, 2 aug. 2017, 16:27 Marostegui, <no-reply@phabricator.wikimedia.org>
a scris:

Marostegui added a comment.

@marcmiquel https://phabricator.wikimedia.org/p/marcmiquel/ is this
still an issue?

*TASK DETAIL*
https://phabricator.wikimedia.org/T105964

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *Marostegui
*Cc: *Marostegui, MZMcBride, Steinsplitter, Reedy, marcmiquel, Aklapper,
Tbscho, Minhnv-2809, Luke081515, Gryllida, scfc, Jay8g, Krenair, chasemp