Page MenuHomePhabricator

Burst of connections on ruwikinews (s3)
Closed, ResolvedPublicSecurity

Description

For the last few days, we've seen a strange pattern in s3, where it has spikes of connections which causes slowdowns on MW:

For all those spikes, the queries I have seen being stuck are always the following and always on ruwikinews:
{P12497}

Right now, we just had another spike and I saw the connections real-time:

root@cumin1001:~# mysql.py -hdb2127 -e "show processlist" | grep Dyna
387981061	wikiuser	10.192.0.169:34066	ruwikinews	Query	52	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
387990014	wikiuser	10.192.16.18:53876	ruwikinews	Query	49	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
387991675	wikiuser	10.192.0.74:50774	ruwikinews	Query	54	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388000813	wikiuser	10.192.16.18:58468	ruwikinews	Query	46	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388004230	wikiuser	10.192.16.16:43216	ruwikinews	Query	39	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388006075	wikiuser	10.192.0.74:60304	ruwikinews	Query	36	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388007145	wikiuser	10.192.0.75:44164	ruwikinews	Query	35	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388007438	wikiuser	10.192.0.75:45230	ruwikinews	Query	53	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388008933	wikiuser	10.192.48.102:47602	ruwikinews	Query	50	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388008966	wikiuser	10.192.0.73:36132	ruwikinews	Query	37	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388010044	wikiuser	10.192.48.101:55732	ruwikinews	Query	47	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388010108	wikiuser	10.192.16.67:53434	ruwikinews	Query	47	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388010667	wikiuser	10.192.16.17:44212	ruwikinews	Query	30	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388010788	wikiuser	10.192.48.100:54416	ruwikinews	Query	45	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388011198	wikiuser	10.192.0.69:40746	ruwikinews	Query	37	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388011607	wikiuser	10.192.16.5:51930	ruwikinews	Query	44	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388011740	wikiuser	10.192.16.68:52432	ruwikinews	Query	44	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388011917	wikiuser	10.192.16.48:52130	ruwikinews	Query	43	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388012732	wikiuser	10.192.16.68:55476	ruwikinews	Query	29	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388012801	wikiuser	10.192.16.68:55676	ruwikinews	Query	41	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388012834	wikiuser	10.192.48.100:33156	ruwikinews	Query	41	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388013095	wikiuser	10.192.0.72:58694	ruwikinews	Query	33	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388014183	wikiuser	10.192.16.6:33602	ruwikinews	Query	33	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388014454	wikiuser	10.192.48.100:37726	ruwikinews	Query	37	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388014580	wikiuser	10.192.16.67:35172	ruwikinews	Query	37	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388014619	wikiuser	10.192.0.75:42876	ruwikinews	Query	37	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388014802	wikiuser	10.192.0.75:43520	ruwikinews	Query	36	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388014956	wikiuser	10.192.16.17:55610	ruwikinews	Query	37	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388015176	wikiuser	10.192.0.75:44540	ruwikinews	Query	36	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388015359	wikiuser	10.192.48.104:51134	ruwikinews	Query	36	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388016047	wikiuser	10.192.16.6:39942	ruwikinews	Query	35	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388016227	wikiuser	10.192.16.6:40482	ruwikinews	Query	34	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388017162	wikiuser	10.192.16.67:41766	ruwikinews	Query	33	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388017205	wikiuser	10.192.16.17:33972	ruwikinews	Query	32	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388017488	wikiuser	10.192.48.104:57378	ruwikinews	Query	32	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388017862	wikiuser	10.192.48.100:47356	ruwikinews	Query	23	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388018135	wikiuser	10.192.16.16:55116	ruwikinews	Query	30	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388018146	wikiuser	10.192.16.6:46968	ruwikinews	Query	31	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388018195	wikiuser	10.192.16.67:44400	ruwikinews	Query	31	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388018641	wikiuser	10.192.0.75:55136	ruwikinews	Query	30	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388019971	wikiuser	10.192.48.103:50462	ruwikinews	Query	27	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388020206	wikiuser	10.192.16.68:47204	ruwikinews	Query	27	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388020340	wikiuser	10.192.48.101:53686	ruwikinews	Query	27	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388020955	wikiuser	10.192.48.102:48446	ruwikinews	Query	25	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388021208	wikiuser	10.192.16.16:35144	ruwikinews	Query	25	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388021891	wikiuser	10.192.16.18:52918	ruwikinews	Query	24	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388021980	wikiuser	10.192.0.73:47628	ruwikinews	Query	24	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388022370	wikiuser	10.192.16.18:54344	ruwikinews	Query	23	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388022399	wikiuser	10.192.16.18:54412	ruwikinews	Query	8	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388023232	wikiuser	10.192.0.72:35724	ruwikinews	Query	21	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388023317	wikiuser	10.192.0.75:43656	ruwikinews	Query	21	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388023337	wikiuser	10.192.0.72:36008	ruwikinews	Query	21	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388023570	wikiuser	10.192.16.17:54354	ruwikinews	Query	20	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388024099	wikiuser	10.192.16.5:40490	ruwikinews	Query	20	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388024530	wikiuser	10.192.16.68:57922	ruwikinews	Query	4	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388024867	wikiuser	10.192.48.101:39290	ruwikinews	Query	17	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388025994	wikiuser	10.192.0.74:37914	ruwikinews	Query	15	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388026484	wikiuser	10.192.0.75:55814	ruwikinews	Query	14	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388027705	wikiuser	10.192.0.76:42602	ruwikinews	Query	12	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388027778	wikiuser	10.192.48.103:44516	ruwikinews	Query	11	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388028004	wikiuser	10.192.0.69:32790	ruwikinews	Query	11	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388028938	wikiuser	10.192.48.103:47500	ruwikinews	Query	9	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388029186	wikiuser	10.192.16.5:60388	ruwikinews	Query	8	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388029331	wikiuser	10.192.0.73:39988	ruwikinews	Query	8	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388029665	wikiuser	10.192.16.17:44570	ruwikinews	Query	8	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388029870	wikiuser	10.192.0.76:49180	ruwikinews	Query	7	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388029943	wikiuser	10.192.48.103:50534	ruwikinews	Query	7	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388030081	wikiuser	10.192.48.102:48454	ruwikinews	Query	7	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388030087	wikiuser	10.192.16.17:45890	ruwikinews	Query	7	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388032627	wikiuser	10.192.48.103:58554	ruwikinews	Query	1	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000
388032969	wikiuser	10.192.0.176:40182	ruwikinews	Query	0	Copying to tmp table	SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp	0.000

That query isn't super cheap, but it is also not extremely bad.

root@db2127.codfw.wmnet[ruwikinews]>  SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp  FROM `page` LEFT JOIN `flaggedpages` ON ((page_id = fp_page_id)) INNER JOIN `categorylinks` `c1` ON ((page_id = c1.cl_from) AND (c1.cl_to='Опубликовано')) INNER JOIN `categorylinks` `c2` ON ((page_id = c2.cl_from) AND (c2.cl_to='Материалы_PanARMENIAN.Net')) LEFT OUTER JOIN `categorylinks` `c3` ON ((page_id = c3.cl_from) AND (c3.cl_to='Не_публиковать')) LEFT OUTER JOIN `categorylinks` `c4` ON ((page_id = c4.cl_from) AND (c4.cl_to='Ожидаемые_события_по_датам')) LEFT OUTER JOIN `categorylinks` `c5` ON ((page_id = c5.cl_from) AND (c5.cl_to='Архивные_новости')) LEFT OUTER JOIN `categorylinks` `c6` ON ((page_id = c6.cl_from) AND (c6.cl_to='Викиновости_коротко'))   WHERE page_namespace = 0 AND (fp_stable IS NOT NULL) AND page_is_redirect = 0 AND c3.cl_to IS NULL AND c4.cl_to IS NULL AND c5.cl_to IS NULL AND c6.cl_to IS NULL  ORDER BY page_id DESC LIMIT 18;
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| page_namespace | page_title                                                                                                                                                                             | cl_timestamp        |
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
|              0 | В_США_10-летний_мальчик_собрал_более_$300_тысяч_на_бронежилеты_для_военных_собак                                                                                                       | 2020-09-05 14:05:53 |
|              0 | В_Минске_задержали_студентов,_вышедших_поддержать_задержанных_ранее_студентов                                                                                                          | 2020-09-05 14:05:24 |
|              0 | В_Армении_ещё_3_человека_умерли_от_алкогольного_отравления                                                                                                                             | 2020-09-05 11:15:34 |
|              0 | В_Арцахе_выявлено_2_новых_случая_коронавируса                                                                                                                                          | 2020-09-05 11:15:09 |
|              0 | Минобороны_РА:_Для_Армении_неприемлемы_попытки_вовлечения_новых_государств_в_карабахский_конфликт                                                                                      | 2020-09-05 10:35:09 |
|              0 | Россия_ввела_санкции_против_Порошенко_и_солиста_«Океана_Эльзы»                                                                                                                         | 2020-09-05 09:45:12 |
|              0 | В_Армении_могут_внедрить_плату_за_утилизацию_автомобилей                                                                                                                               | 2020-09-05 07:45:08 |
|              0 | В_Иране_более_200_человек_пострадали_в_результате_взрыва_баллона_с_хлором                                                                                                              | 2020-09-05 07:15:15 |
|              0 | В_Армении_осталось_менее_4000_активных_больных_коронавирусом                                                                                                                           | 2020-09-05 07:15:06 |
|              0 | Нарушителям_карантинного_режима_в_Армении_будут_грозить_штрафы_и_арест                                                                                                                 | 2020-09-04 15:15:05 |
|              0 | Военная_техника_США_прибыла_в_Грузию_для_участия_в_учениях_Noble_Partner_2020                                                                                                          | 2020-09-04 14:45:06 |
|              0 | В_США_7_полицейских_уволены_после_смерти_афроамериканца                                                                                                                                | 2020-09-04 14:35:09 |
|              0 | ВОЗ_не_ожидает_массовой_вакцинации_от_Covid-19_до_середины_2021_года                                                                                                                   | 2020-09-04 14:05:06 |
|              0 | NYT:_Трамп_может_вывести_США_из_НАТО                                                                                                                                                   | 2020-09-04 13:45:08 |
|              0 | Более_100_евродепутатов_требуют_провести_международное_расследование_отравления_Навального                                                                                             | 2020-09-04 12:55:08 |
|              0 | Российская_вакцина_от_Covid-19_поступила_в_медцентры                                                                                                                                   | 2020-09-04 12:15:07 |
|              0 | В_Грузии_за_сутки_выявлено_28_случаев_коронавируса                                                                                                                                     | 2020-09-04 12:05:10 |
|              0 | Число_случаев_коронавируса_в_Арцахе_превысило_300                                                                                                                                      | 2020-09-04 10:45:19 |
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
18 rows in set (2.282 sec)

root@db2127.codfw.wmnet[ruwikinews]> explain SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp  FROM `page` LEFT JOIN `flaggedpages` ON ((page_id = fp_page_id)) INNER JOIN `categorylinks` `c1` ON ((page_id = c1.cl_from) AND (c1.cl_to='Опубликовано')) INNER JOIN `categorylinks` `c2` ON ((page_id = c2.cl_from) AND (c2.cl_to='Материалы_PanARMENIAN.Net')) LEFT OUTER JOIN `categorylinks` `c3` ON ((page_id = c3.cl_from) AND (c3.cl_to='Не_публиковать')) LEFT OUTER JOIN `categorylinks` `c4` ON ((page_id = c4.cl_from) AND (c4.cl_to='Ожидаемые_события_по_датам')) LEFT OUTER JOIN `categorylinks` `c5` ON ((page_id = c5.cl_from) AND (c5.cl_to='Архивные_новости')) LEFT OUTER JOIN `categorylinks` `c6` ON ((page_id = c6.cl_from) AND (c6.cl_to='Викиновости_коротко'))   WHERE page_namespace = 0 AND (fp_stable IS NOT NULL) AND page_is_redirect = 0 AND c3.cl_to IS NULL AND c4.cl_to IS NULL AND c5.cl_to IS NULL AND c6.cl_to IS NULL  ORDER BY page_id DESC LIMIT 18;
+------+-------------+--------------+--------+------------------------------------------------+--------------+---------+-----------------------------+--------+-----------------------------------------------------------+
| id   | select_type | table        | type   | possible_keys                                  | key          | key_len | ref                         | rows   | Extra                                                     |
+------+-------------+--------------+--------+------------------------------------------------+--------------+---------+-----------------------------+--------+-----------------------------------------------------------+
|    1 | SIMPLE      | c2           | ref    | PRIMARY,cl_timestamp,cl_sortkey                | cl_timestamp | 257     | const                       | 153660 | Using where; Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | c1           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY      | 261     | ruwikinews.c2.cl_from,const |      1 | Using where                                               |
|    1 | SIMPLE      | page         | eq_ref | PRIMARY,name_title,page_redirect_namespace_len | PRIMARY      | 4       | ruwikinews.c2.cl_from       |      1 | Using where                                               |
|    1 | SIMPLE      | c3           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY      | 261     | ruwikinews.c2.cl_from,const |      1 | Using where; Using index; Not exists                      |
|    1 | SIMPLE      | c4           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY      | 261     | ruwikinews.c2.cl_from,const |      1 | Using where; Using index; Not exists                      |
|    1 | SIMPLE      | c5           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY      | 261     | ruwikinews.c2.cl_from,const |      1 | Using where; Using index; Not exists                      |
|    1 | SIMPLE      | c6           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY      | 261     | ruwikinews.c2.cl_from,const |      1 | Using where; Using index; Not exists                      |
|    1 | SIMPLE      | flaggedpages | eq_ref | PRIMARY                                        | PRIMARY      | 4       | ruwikinews.c2.cl_from       |      1 | Using where                                               |
+------+-------------+--------------+--------+------------------------------------------------+--------------+---------+-----------------------------+--------+-----------------------------------------------------------+
8 rows in set (0.035 sec)

It is impossible say whether that is the cause of the slowdowns or the consequence (something else overloading the host and this query getting stuck).
However it looks like the spikes are following some sort of pattern

The query killer kicks in and start killing queries, and those are the ones always being killed.
This started happening on Sunday apparently and whilst yesterday it only hit early in the morning, this probably deserves some investigation as it is triggering MW alerts:

@Krinkle created a document on Sunday about this issue on Sunday when we first responded to it: https://docs.google.com/document/u/1/d/1fzvcma-05RUMuKzFvsOINEbygSSdq96FXz2HSO3TN10/edit#

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2020, 5:42 AM

I have created this as a security task for now as we are not sure if this traffic is legit or not

Marostegui triaged this task as High priority.Sep 8 2020, 8:41 AM

This just happened again.

Same queries got stuck on the same wiki (ruwikinews)

For the record, this is what logstash shows:

message	   	[a03597f5-4456-48db-bb43-970d34ec7dcc] /wiki/%D0%92_%D0%9D%D0%B0%D0%B3%D0%BE%D1%80%D0%BD%D0%BE%D0%BC_%D0%9A%D0%B0%D1%80%D0%B0%D0%B1%D0%B0%D1%85%D0%B5_%D0%B7%D0%B0%D0%B2%D0%B5%D1%80%D1%88%D0%B8%D0%BB%D0%B8%D1%81%D1%8C_%D0%BA%D0%BE%D0%BC%D0%B0%D0%BD%D0%B4%D0%BD%D0%BE-%D1%88%D1%82%D0%B0%D0%B1%D0%BD%D1%8B%D0%B5_%D1%83%D1%87%D0%B5%D0%BD%D0%B8%D1%8F_%D0%90%D1%80%D0%BC%D0%B8%D0%B8_%D0%BE%D0%B1%D0%BE%D1%80%D0%BE%D0%BD%D1%8B_%D0%9D%D0%9A%D0%A0   WMFTimeoutException from line 41 of /srv/mediawiki/wmf-config/set-time-limit.php: the execution time limit of 60 seconds was exceeded
Joe added a comment.Sep 8 2020, 10:00 AM

The only thing vaguely related is a small spike in requests to ruwikinews from google bot, but nothing that would justify this.

Joe added a comment.Sep 8 2020, 10:03 AM

Also I notice now all the queries are coming from jobrunners, so we should look at jobs.

Another spike at the moment.

Urbanecm added a comment.EditedSep 8 2020, 11:45 AM

Those are apparently from dynamic page list, what about simply disabling that extension? It causes performance issues in the past, maybe ruwikinews grow up for that to be unbearable? See T14423 and T124841.

Another option is that _some_ DPL queries from ruwikinews are more expensive than the rest.

Also I notice now all the queries are coming from jobrunners, so we should look at jobs.

I assume that is because page cache invalidation happens in jobs, IIRC.

Those are apparently from dynamic page list, what about simply disabling that extension? It causes performance issues in the past, maybe ruwikinews grow up for that to be unbearable? See T14423 and T124841.

+1 we can try that and see if it ceases.

Another option is that _some_ DPL queries from ruwikinews are more expensive than the rest.

Also I notice now all the queries are coming from jobrunners, so we should look at jobs.

I assume that is because page cache invalidation happens in jobs, IIRC.

Huge performance issues caused by DPL is the reason we don't enable it in more wikis (while several wikis wanted to, here's an example T102821). I think we should just disable DPL until T124841: Address performance needs for Wikimedia from DynamicPageList extension so that it can be deployed to further wikis is addressed.

Another spike just hit s3.
Who could own the disablement of that extension? So at least we can know whether it is the culprit or not

https://www.mediawiki.org/wiki/Developers/Maintainers says editing team. Adding its PM and a couple of engineers from the team.

https://www.mediawiki.org/wiki/Developers/Maintainers says editing team. Adding its PM and a couple of engineers from the team.

Thank you for the ping, @Ladsgroup. Would it be accurate for me to understand the action item for our team as the following?

Determine whether there are any reasons not to disable DynamicPageList extension in order to see if it's the cause of the spike in queries y'all are observing. If we identify no such reasons, disable the extension.

@ppelberg thanks for coming back so fast! :)
Yeah, essentially we are seeing spikes of connections causing overload on our databases. They all seem to be hitting ruwikinews and they look related to that extension.
We are not 100% sure if this extension is the cause of the issues, or just something that is more visible because of something else causing the overload on the hosts.

We'd like to disable that extension and keep monitoring the hosts to see if that stops the problems we are seeing.

While we might be marked as maintainers, we're hardly more knowledgeable than anyone else, we "maintain" like 40 extensions and we've never worked on most of them.

That said: as far as I know, it can be safely disabled, that won't cause any loss of data or any other lasting damage. The dynamic lists will obviously stop rendering on pages that use them though, so someone should let the ruwikinews community know that we're doing this and why.

Also: I looked at the code briefly and there's an interesting config variable $wgDLPMaxCacheTime, set by default to 86400 (1 day). This forces all pages using the <DynamicPageList> tag to be re-parsed every day (which very neatly matches your daily spikes). You might want to raise this value instead of disabling the extension (it might only take effect on the next time the pages are parsed, I'm not sure how parser cache works here).

Also, @Ladsgroup @Marostegui am I understanding correctly that you're planning to write and deploy the actual config patch yourself? or do you need us to do it?

I'm always more than happy to do it.

While we might be marked as maintainers, we're hardly more knowledgeable than anyone else, we "maintain" like 40 extensions and we've never worked on most of them.

That said: as far as I know, it can be safely disabled, that won't cause any loss of data or any other lasting damage. The dynamic lists will obviously stop rendering on pages that use them though, so someone should let the ruwikinews community know that we're doing this and why.

Also: I looked at the code briefly and there's an interesting config variable $wgDLPMaxCacheTime, set by default to 86400 (1 day). This forces all pages using the <DynamicPageList> tag to be re-parsed every day (which very neatly matches your daily spikes). You might want to raise this value instead of disabling the extension (it might only take effect on the next time the pages are parsed, I'm not sure how parser cache works here).

This is a very good point, maybe we can raise it from 1 day to 7 days and keep monitoring. If we stop seeing the spikes, then we've got our offender. How does that sound?

Also, @Ladsgroup @Marostegui am I understanding correctly that you're planning to write and deploy the actual config patch yourself? or do you need us to do it?

I am definitely not comfortable doing that, in fact, I wouldn't know where to disable it :-) - so if possible I would like you or someone from your team to prepare and deploy the patch.
We, DBAs, normally only deploy mw-config and scap deployments for db-eqiad.php and db-codfw.php and that's becoming less and less usual now that we've moved most of our DBs out of those files.

So this is deployed now, I haven't put in gerrit yet but needs communication to wikis that have it enabled. I guess Editing team can help with that? Worth noting that this mitigates the issue but doesn't solve it. We really need to either completely undeploy this extension or do lots of improvements on its queries.

Thank you Amir!
I will keep monitoring db2127's graphs to see if we see this happening.
Last spike just happened, at around 17:50 UTC, so if this was the cause, we shouldn't see it happening again. If we do, and they are the same queries, then I guess we need to go for the full disablement approach

Gilles moved this task from Inbox to Radar on the Performance-Team board.Sep 8 2020, 6:46 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

So this is deployed now, I haven't put in gerrit yet but needs communication to wikis that have it enabled. I guess Editing team can help with that? Worth noting that this mitigates the issue but doesn't solve it. We really need to either completely undeploy this extension or do lots of improvements on its queries.

This patch got pushed to gerrit by a script that is meant to update wikiversions: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/625950

ppelberg added a comment.EditedSep 8 2020, 6:53 PM

So this is deployed now, I haven't put in gerrit yet but needs communication to wikis that have it enabled. I guess Editing team can help with that? Worth noting that this mitigates the issue but doesn't solve it. We really need to either completely undeploy this extension or do lots of improvements on its queries.

@Ladsgroup yep, we can.

Can you confirm the "Consequences" of 625950?

Consequences
What follows are the changes in behavior editors at ru.wikinews [i] can expect as a result of
625950:

  • On pages that use the DynamicPageList extension, editors will notice that related pages will be updated every 7 days instead of daily as had been the behavior before this change.

i. https://ru.wikinews.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0

@ppelberg It means the parsed lists will stay in cache for longer time, and that they will get updated less frequently. The notice can be something like:

Due to performance issues [that caused recent issues with the servers] with DynamicPageList extension, the list of pages is cached for 7 days rather than 1. That means lists generated via DynamicPageList extension will be updated less frequently. It is possible the extension will be removed from all wikis/ruwikinews, if increase of caching does not help the issue.

or something similar :-).

@ppelberg It means the parsed lists will stay in cache for longer time, and that they will get updated less frequently. The notice can be something like:

Due to performance issues [that caused recent issues with the servers] with DynamicPageList extension, the list of pages is cached for 7 days rather than 1. That means lists generated via DynamicPageList extension will be updated less frequently. It is possible the extension will be removed from all wikis/ruwikinews, if increase of caching does not help the issue.

or something similar :-).

Got it – thank you, @Urbanecm!

I had a closer look at the problematic query.

What it does is: return 18 most recently created main namespace pages that belong to both categories "Опубликовано" and "Материалы_PanARMENIAN.Net". This is used to generate the list of related articles shown when reading an article belonging to this category, e.g. https://ru.wikinews.org/wiki/Макрон_наградил_Астрид_Паносян_Национальным_орденом (and ~100,000 other pages).

It does this by first listing all the pages in those categories, then sorting them by creation date and taking the latest 18. The problem is that the two categories are huge (~99% and ~70% of all main namespace pages), so the first step is much slower than normal.

I experimented a bit and it would actually be faster to do it the opposite way: scan the most recently created pages, until it finds 18 of them that belong to those categories. This version using STRAIGHT_JOIN to do that was faster when I tested on analytics replicas, and looks nicer in the EXPLAIN output:

mysql:research@s3-analytics-replica.eqiad.wmnet [ruwikinews]> explain SELECT /* DynamicPageListHooks::renderDynamicPageList  */  page_namespace,page_title,c1.cl_timestamp  FROM `page` LEFT JOIN `flaggedpages` ON ((page_id = fp_page_id)) STRAIGHT_JOIN `categorylinks` `c1` ON ((page_id = c1.cl_from) AND (c1.cl_to='Опубликовано')) STRAIGHT_JOIN `categorylinks` `c2` ON ((page_id = c2.cl_from) AND (c2.cl_to='Материалы_PanARMENIAN.Net')) LEFT OUTER JOIN `categorylinks` `c3` ON ((page_id = c3.cl_from) AND (c3.cl_to='Не_публиковать')) LEFT OUTER JOIN `categorylinks` `c4` ON ((page_id = c4.cl_from) AND (c4.cl_to='Ожидаемые_события_по_датам')) LEFT OUTER JOIN `categorylinks` `c5` ON ((page_id = c5.cl_from) AND (c5.cl_to='Архивные_новости')) LEFT OUTER JOIN `categorylinks` `c6` ON ((page_id = c6.cl_from) AND (c6.cl_to='Викиновости_коротко'))   WHERE page_namespace = 0 AND (fp_stable IS NOT NULL) AND page_is_redirect = 0 AND c3.cl_to IS NULL AND c4.cl_to IS NULL AND c5.cl_to IS NULL AND c6.cl_to IS NULL  ORDER BY page_id DESC LIMIT 18;
+------+-------------+--------------+--------+------------------------------------------------+---------+---------+-------------------------------+------+--------------------------------------+
| id   | select_type | table        | type   | possible_keys                                  | key     | key_len | ref                           | rows | Extra                                |
+------+-------------+--------------+--------+------------------------------------------------+---------+---------+-------------------------------+------+--------------------------------------+
|    1 | SIMPLE      | page         | index  | PRIMARY,name_title,page_redirect_namespace_len | PRIMARY | 4       | NULL                          | 450  | Using where                          |
|    1 | SIMPLE      | c3           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY | 261     | ruwikinews.page.page_id,const | 1    | Using where; Using index; Not exists |
|    1 | SIMPLE      | c4           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY | 261     | ruwikinews.page.page_id,const | 1    | Using where; Using index; Not exists |
|    1 | SIMPLE      | flaggedpages | eq_ref | PRIMARY                                        | PRIMARY | 4       | ruwikinews.page.page_id       | 1    | Using where                          |
|    1 | SIMPLE      | c1           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY | 261     | ruwikinews.page.page_id,const | 1    | Using where                          |
|    1 | SIMPLE      | c2           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY | 261     | ruwikinews.page.page_id,const | 1    | Using where; Using index             |
|    1 | SIMPLE      | c5           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY | 261     | ruwikinews.page.page_id,const | 1    | Using where; Using index; Not exists |
|    1 | SIMPLE      | c6           | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                | PRIMARY | 261     | ruwikinews.page.page_id,const | 1    | Using where; Using index; Not exists |
+------+-------------+--------------+--------+------------------------------------------------+---------+---------+-------------------------------+------+--------------------------------------+

@Marostegui Can you convince MySQL to magically use this kind of plan for this query? Or, do you think it's a good idea to generate queries with STRAIGHT_JOIN when doing these kinds of queries on very large categories? (I imagine this would be much slower for normal-sized categories, so it'd have to be conditional.)

So this has kept happening overnight, the last spike was a couple of hours ago (same queries and same wiki). The change we made yesterday to stop parsing every day and parsing everyweek...does that take effect straightaway or the pages that are flagged to be reparsed will keep being reparsed and we'll see the effect in one week?

@Marostegui Can you convince MySQL to magically use this kind of plan for this query? Or, do you think it's a good idea to generate queries with STRAIGHT_JOIN when doing these kinds of queries on very large categories? (I imagine this would be much slower for normal-sized categories, so it'd have to be conditional.)

Thanks a lot for the investigation @matmarex for the research!

We can try to give a hint to the optimizer using some USE INDEX or FORCE INDEX or even IGNORE INDEX for the existing query. I would rather try that first than going for the STRAIGHT JOIN if we can avoid it.
If we find that the STRAIGHT JOIN is our best chance here, we'd need to test some example queries on the big wikis, as I am a bit afraid on how crazy the optimizer can go there with that sort of JOIN.

@matmarex if you can change a bit the existing query to insert some USE INDEX for something that would make more sense, we can try that manually on the existing hosts and check what the optimizer would do and see if that results on a better plan.
While the query you've made with STRAIGHT JOIN seems certainly much better than the existing...let's see if a USE INDEX is enough as that would require less changes and less potential issues with the optimizer.

Thank you again!

PS: We still don't know if this is the culprit or the issue or just the result of something else slowing down all the big queries :-(

If increasing cache didn't help, I'm in favor of removing the extension from ruwikinews, especially as this seems to cause some real issues for other users. I can do that later today unless someone has a better idea on what to do.

Yeah, it definitely didn't help, I am seeing the queries just now (another spike).
@Urbanecm +1 from my side to disable it and see if that stop the issues.

Thank you - I am going to keep monitoring s3 during the day. I will keep you all posted

so someone should let the ruwikinews community know that we're doing this and why.

@ppelberg: If I interpret T262240#6443986 correctly, could you (or someone) please provide a link where this has been communicated? Because T262391.

If I interpret T262240#6443986 correctly, could you (or someone) please provide a link where this has been communicated? Because T262391.

Ah, found it: https://ru.wikinews.org/wiki/Викиновости:Форум/Общий#DynamicPageList . Sorry!

This is still very early to confirm, but the effect of disabling that extension on ruwikinews had this effect:

And this is a 2 weeks graph, that shows that we are back at previous values

Base added a subscriber: Base.Sep 9 2020, 1:41 PM

@Marostegui Per T262391#6446944, _all_ content pages on ruwikinews apparently had DPL injected. That means that every time DPL decided to re-parse its pages, it needed to parse _all_ content pages (that means about ~150k of pages). T262391#6446944 have that on Aug 29, the number of all pages in ruwikinews increased from 30k to the current number, 150k. That means the number of work DPL needed to do was about 5 times higher. By looking at the numbers, does that sound like a possible explanation of the issue? If it sounds plausible, we may ask ruwikinews to remove DPL from all pages - one of solutions is to have it only in recent pages, or to include a link to a page that has it. What do you think?

@Marostegui Per T262391#6446944, _all_ content pages on ruwikinews apparently had DPL injected. That means that every time DPL decided to re-parse its pages, it needed to parse _all_ content pages (that means about ~150k of pages). T262391#6446944 have that on Aug 29, the number of all pages in ruwikinews increased from 30k to the current number, 150k. That means the number of work DPL needed to do was about 5 times higher. By looking at the numbers, does that sound like a possible explanation of the issue? If it sounds plausible, we may ask ruwikinews to remove DPL from all pages - one of solutions is to have it only in recent pages, or to include a link to a page that has it. What do you think?

I am not fully aware of how DPL works behind the scenes, but those and that huge increase is something that can definitely cause issues, especially if the queries are doing kinda full scans.
On this query we can see the number of scanned rows, which sort of matches those numbers:

|    1 | SIMPLE      | c2           | ref    | PRIMARY,cl_timestamp,cl_sortkey                | cl_timestamp | 257     | const                       | 153660 | Using where; Using index; Using temporary; Using filesort |

So far, since the disablement of the extension I haven't seen any spike and the graphs posted at T262240#6446166 are still the same, no more increases:

Let's keep monitoring for at least 24h more hours to be fully sure.

So this has kept happening overnight, the last spike was a couple of hours ago (same queries and same wiki). The change we made yesterday to stop parsing every day and parsing everyweek...does that take effect straightaway or the pages that are flagged to be reparsed will keep being reparsed and we'll see the effect in one week?

I think the cache time is associated with each page's parsed content, so the increase to $wgDLPMaxCacheTime would have effectively taken effect after the next time those pages are reparsed (so… today?). I guess we won't know, since the extension was just disabled entirely.

@Marostegui Per T262391#6446944, _all_ content pages on ruwikinews apparently had DPL injected. That means that every time DPL decided to re-parse its pages, it needed to parse _all_ content pages (that means about ~150k of pages). T262391#6446944 have that on Aug 29, the number of all pages in ruwikinews increased from 30k to the current number, 150k. That means the number of work DPL needed to do was about 5 times higher. By looking at the numbers, does that sound like a possible explanation of the issue? If it sounds plausible, we may ask ruwikinews to remove DPL from all pages - one of solutions is to have it only in recent pages, or to include a link to a page that has it. What do you think?

As far as I know, it's not exactly like that, DPL doesn't decide to parse "its" pages. The parse just happens separately for each page, as the effect of the parser cache for that page expiring.

However, since apparently we had 120,000 pages imported by a bot at nearly the same time, their caches probably all expire at nearly the same time, and they cause this huge spike?

@Marostegui So here's a silly question… Do all of the problematic queries involve the category "Материалы_PanARMENIAN.Net" (this is the category containing the big bot import)? And is it all literally the same query, or just a few, or are there many different slow queries?

I wonder if we could simply add some caching to avoid running the same query over and over. You said at the beginning that it is "not extremely bad" and I wonder if just caching the result for a couple of hours (instead of running it for every parse) would be enough to avoid major problems.

Bawolff added a comment.EditedSep 10 2020, 4:00 AM

So summary of what happened

  • Newsbot imports a massive number (~100,000) of pages that have {{тема|Материалы PanARMENIAN.Net}} ( link) on them and are in [[category:Материалы PanARMENIAN.Net]] in very short time frame
    • This template adds <dynamicpagelist>category = Опубликовано category = Материалы PanARMENIAN.Net notcategory = Не публиковать notcategory = Ожидаемые события по датам notcategory = Архивные новости notcategory=Викиновости коротко count = 18 stablepages = only suppresserrors = true namespace = Main addfirstcategorydate = true ordermethod = created </dynamicpagelist> To these pages. Key point this corresponds to the query above, where Опубликовано is the c1 inner category join, c2 is the Материалы PanARMENIAN.Net inner category join, the LIMIT is 18 (count=18) and the ORDER BY is page_id (ordermethod=created)
  • So we now have 100k pages that have to run this kind of inefficient query
  • page caching doesn't particularly help, as all these pages are new and have to be rendered at least once
  • PoolCounter doesn't help because the query is on many pages (Perhaps one mitigation could be to use the pool counter to limit the max number of things that can make a DPL query at one time, to prevent query pileup)
  • The mitigation of changing DPL cache time probably doesn't help, as its unlikely these parses are being triggered by view after parse cache expiry, its probably all initial insert and jobqueue template changes

So potential things we can do:

  • Possibly changing dpl to order by c1.cl_from instead of page_id might help mildly with no behaviour change but not a lot of performance difference. (If you want to get fancy, check for how big each category is, and sort by the c1.cl_from of the smallest one). {Edit: Actually not sure if this would help anything. Probably have no affect. It would have to be filesorted either way, as the index is (cl_from, cl_to) and not the reverse]
  • Possibly sending DPL queries to a specific subgroup of replicas might reduce risk a bit
  • Possibly we could use poolcounter to limit the max number of cocurrent DPL executions.
  • We could tell they can have DPL back as long as they don't put an infobox with Материалы PanARMENIAN.Net dpl on newly created bot articles, or require they import slowly (There's lots of risk here figuring out exact requirements. Maybe something like, if you have a DPL on a lot of pages, it must have at least one category in it with < 20,000 pages or something)
  • We could independently cache in memcached the results of a DPL based on the hash of its input for some short period of time (Even 5 minutes). In this particular case, the issue was the same query was calculated 100,000 times. Even a short lived cache of that query probably would have saved the db

Thank you everyone for helping out here!.

For what is worth, since the extension was disabled yesterday, we've not had any other spike and traffic remains within normal values:


Also no CPU saturation anymore

Anecdotically, it is interesting to see that after the extension was disable, the hosts' temperature sensors are reporting less temperature:

@Marostegui So here's a silly question… Do all of the problematic queries involve the category "Материалы_PanARMENIAN.Net" (this is the category containing the big bot import)? And is it all literally the same query, or just a few, or are there many different slow queries?

Unfortunately I don't know. The query killer doesn't store the whole query, but just the first portion of it.
I have checked also Tendril, but it also doesn't store the categories, but just the "????" instead, so hard to know. Also scrolling thru my terminal's buffer didn't help. The only way to know is to run show full processlist _during_ one of the spikes, which are hard to predict, and they are no longer happening after the disablement.

  • Possibly sending DPL queries to a specific subgroup of replicas might reduce risk a bit

This specific bit won't help much on this case.
s3 has 900 wikis, but they are not super big, so we don't have dedicated vslow hosts like we do for massive wikis like enwiki or wikidatawiki. Given how expensive database servers are, we try to maximize their usage, and we found that having a dedicate vslow host in s3 didn't really make sense, as it would seat idle most of the time, so it does serve a big portion of normal traffic too.

I think the cache time is associated with each page's parsed content, so the increase to $wgDLPMaxCacheTime would have effectively taken effect after the next time those pages are reparsed (so… today?). I guess we won't know, since the extension was just disabled entirely.

If it helps to investigate issues and come up with a long-term solution, maybe we can re-enable it for a few hours I would guess? (I could also try to capture the entire queries to answer your earlier question).

I made a patch which i believe would reduce risk of this particular type of cache stampede (not sure if that's right word): https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626268/

Bawolff added a comment.EditedSep 10 2020, 7:06 AM

If newsbot is done with its import, it means future refresh to those pages might only be through the job queue. I think job queue has fancy code to monitor db health and adjust speed, although that might only be for replica lag and as far as i understand, a replica being overloaded with queries doesn't actually contribute to replica lag.

Update: no more spikes related to this issue have happened since the disablement of the extension at T262240#6445620. So it is probably safe to assume that the root cause of those were indeed DPL on ruwikinews.

(Note: If someone happens to check db2127's graphs and see some spikes, those are unrelated to this issue, they had to do with some maintenance going on on s3).

Thanks @Marostegui. I guess we should tell ruwikinews this extension will be disabled indefinitely, until the performance is improved? @ppelberg's team can perhaps help to do it?

Bawollf's patch (thanks!) could help, as DPL runs the same query multiple times (each time it is included somewhere, and the page is re-parsed). @Marostegui Once it's merged, do you think it would be possible to test-enable the extension again, as such issues are very hard to debug outside of production?

Thanks @Marostegui. I guess we should tell ruwikinews this extension will be disabled indefinitely, until the performance is improved? @ppelberg's team can perhaps help to do it?

Yeah, I believe so, as this is impacting the rest of wikis on s3, which are more than 900.

Bawollf's patch (thanks!) could help, as DPL runs the same query multiple times (each time it is included somewhere, and the page is re-parsed). @Marostegui Once it's merged, do you think it would be possible to test-enable the extension again, as such issues are very hard to debug outside of production?

I am happy to try it again, but obviously not today, as during the weekend the availability is very reduced. But during next week (or whenever the patch gets merged) I am fine if we want to reenable it and monitor it.

So this has kept happening overnight, the last spike was a couple of hours ago (same queries and same wiki). The change we made yesterday to stop parsing every day and parsing everyweek...does that take effect straightaway or the pages that are flagged to be reparsed will keep being reparsed and we'll see the effect in one week?

@Marostegui Can you convince MySQL to magically use this kind of plan for this query? Or, do you think it's a good idea to generate queries with STRAIGHT_JOIN when doing these kinds of queries on very large categories? (I imagine this would be much slower for normal-sized categories, so it'd have to be conditional.)

Thanks a lot for the investigation @matmarex for the research!

We can try to give a hint to the optimizer using some USE INDEX or FORCE INDEX or even IGNORE INDEX for the existing query. I would rather try that first than going for the STRAIGHT JOIN if we can avoid it.
If we find that the STRAIGHT JOIN is our best chance here, we'd need to test some example queries on the big wikis, as I am a bit afraid on how crazy the optimizer can go there with that sort of JOIN.

@matmarex if you can change a bit the existing query to insert some USE INDEX for something that would make more sense, we can try that manually on the existing hosts and check what the optimizer would do and see if that results on a better plan.
While the query you've made with STRAIGHT JOIN seems certainly much better than the existing...let's see if a USE INDEX is enough as that would require less changes and less potential issues with the optimizer.

Thank you again!

PS: We still don't know if this is the culprit or the issue or just the result of something else slowing down all the big queries :-(

Reading through the backscroll of this bug a bit more calmly - we probably don't want to do this generally. Its more efficient in this case because the two categories are very big and on-par with the size of the page table (About 1/70th the size of the page table). However in the more common case of a much smaller categories, a straight_join would be significantly less performant.

Also: I looked at the code briefly and there's an interesting config variable $wgDLPMaxCacheTime, set by default to 86400 (1 day). This forces all pages using the <DynamicPageList> tag to be re-parsed every day (which very neatly matches your daily spikes).

Just to clarify, is there a daily spike? The graphs on this task show regular spikes, but they look much more often then daily. I suppose the sample size was not long enough to figure out if there was a day long pattern with much certainty. If there were a daily spike pattern suggestive of parser cache expiring - it would suggest that the performance problems are much deeper than just User:NewsBots's mass editing.

So, my proposal for re-enabling on ruwikinews, would be to do it with the following settings: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/626919

This depends on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626796 & https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626268 which are waiting on code review. If someone could look at them come monday, I would super appreciate that.

sbassett moved this task from Incoming to Watching on the Security-Team board.Sep 14 2020, 3:25 PM

FYI: DPL was now re-enabled at ruwikinews, because ruwikinews reported "we have removed DPL from over 100,000 pages". I discussed that with @Marostegui, and he said he will monitort s3's load. Let see how it goes. If it will be bad, we'll disable it again, and try it again with your patches (which should be reviewed&merged anyway :-)).

So far, after the re-enablement, I am not seeing the values coming back to previous values when DPL was enabled, which is good news:

So far so good.
In the last 24h there's only one DPL query reported, nothing major or out of the ordinary on the graphs.

Things keep looking fine on the db-land.
Probably we should get @Bawolff patch reviewed, merged, monitor for a couple of days and call this resolved if things keep being quiet?

Urbanecm moved this task from Backlog to Working on on the User-Urbanecm board.Sep 17 2020, 3:55 PM

Any update on here? Should we follow the pending steps at T263220?
What's pending for now, merging @Bawolff patch?

Marostegui closed this task as Resolved.Sep 29 2020, 7:50 AM
Marostegui assigned this task to Bawolff.

I am going to close this. The scope of this task is mitigated, let's follow up on T263220: Limit concurrency of DPL queries
Thanks everyone who's helped on getting this initial impact mitigated.

Looks like this is happening again, see: T266775

In the interest of transparency and knowledge sharing, can this task be public? AFAICT, there is not private info on here, the traffic wasn't an intentional DOS, and the underlying issue is now fixed.

Sounds good to me

Bawolff changed the visibility from "Custom Policy" to "Public (No Login Required)".Nov 17 2020, 5:47 AM
Bawolff changed the edit policy from "Custom Policy" to "All Users".