I reproduced this locally, my fix above seems to be sufficient. I'm pretty sure the intention of Aaron's change was to replace QUERY_IGNORE_DBO_TRX | QUERY_CHANGE_NONE with QUERY_CHANGE_LOCKS purely for brevity, with the same semantics applied.
People are begging for a release to fix this issue at https://github.com/php-memcached-dev/php-memcached/issues/496
I was able to reproduce the bug and confirm that the commit I linked to fixes it.
Tue, Jan 25
I think this task only refers to the specific case of update.php, so this is resolved. T289926 is for the audit.
For debug output, is_resource() is often appropriate, e.g.
Mon, Jan 24
There is this commit in the php-memcached repo which may be related: https://github.com/php-memcached-dev/php-memcached/commit/51c9baf49f96c5f35be8257549f426ef1860f0ef
That's an odd error message, is that reproducible?
This was fixed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/755846
Sun, Jan 23
Fri, Jan 21
Thu, Jan 20
Was any benchmarking done to support the introduction of this feature? There was no linked bug on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/580607 . I'm having trouble trying to find a test case for which it is faster than json_encode().
Wed, Jan 19
If it's a bug, then the bug is in the ICU sort order not in MediaWiki's first letter identification.
There is openssl_digest() which presumably has hardware acceleration and can do SHA-256 in 2.1 seconds per gigabyte. But its input is a single string, not a stream, so it can't quite work as a drop-in replacement. At least it gives you some idea of what an optimisation patch for PHP could do.
Tue, Jan 18
52 seconds in Shellbox\Client::computeHmac over 3 calls, I guess all signatures for the remote shellbox calls
Mon, Jan 17
Thu, Jan 13
It would have been complicated to set up CommonsMetadata locally in order to reproduce the bug, so I modelled the bug with a patch to FormatMetadata:
In s2 and s3 I used
- In s3, the affected wikis were elwiktionary eowiktionary hiwiki incubatorwiki itwikiquote ruwiktionary simplewiki specieswiki trwiktionary. The total restored row count is 226335.
- In s2, only itwiki is affected. I undeleted 11565 rows.
- Using the statement-based binlog, I confirmed that there were no affected deletes on commonswiki.iwlinks.
- I confirmed that there were no affected deletes on commonswiki.templatelinks. Templatelinks is rarely affected because the bug is triggered when multiple target namespaces appear in a delete query.
- On wikidatawiki there was only one bad query, and it deleted 1.7M rows linking to [[P156]]. I manually trimmed the unrelated rows from the undelete SQL and started the import.
I used the statement-based binlog to determine that for commonswiki pagelinks, the first affected query was at 20:20:40 and the last was at 20:33:23. So I'll narrow the range for undeletion accordingly.
The plan is to undelete pagelinks on commonswiki which were deleted between 20:19:00 and 20:40:00 using the pasted perl script and sql.php with the batch size patch. The sooner the better, since the undeleted link data becomes more stale as time goes by. I will go ahead as soon as someone reviews the concept and voices approval.
I wrote this perl script: P18710. It makes a 500MB SQL file from the commonswiki pagelinks deletions.
Wed, Jan 12
I mean the insert traffic will be there already, since the old version will slowly repair deleted links. So we should throttle the jobs now if they are a problem.
There will be additional insert traffic as refreshlinks jobs reinsert the deleted links.
DELETE FROM `templatelinks` WHERE (tl_from = 9691118 AND (tl_namespace = 10 AND tl_title IN ('LangSwitch','Purge') ) OR (tl_namespace = 828 AND tl_title = 'LangSwitch'))
I explored the problem using the patch above. My wiki had 38 extensions enabled and generated a preload file with 2833 files. I implemented both a maintenance script and a dynamic approach. I used PHP 8.0.
Tue, Jan 11
I don't think it's a parser test issue. Those comments are leaking into the page indicators and need to be stripped out.
Never mind, it doesn't work that way. All data from the preload script is discarded. Only functions and classes are preserved.
Mon, Jan 10
I have an idea:
Fri, Jan 7
Just make it 100 everywhere.
The revert is merged in master. Any remaining work is part of T161976.
Is the procedure the one documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments ?
Thu, Jan 6
It can be reverted in master, the feature request was not urgent.
I think estimateRowCount() can be used to choose decide whether to set the STRAIGHT_JOIN option. I tried a few EXPLAIN SELECT queries and they're only out by a factor of 2 or so, which should be good enough.
It's just a query planning error. If you force the join order then it's fast.
Wed, Jan 5
I was not able to reproduce that error.
Tue, Jan 4
I filed T298573 for the kernel tuning issue.
Dec 16 2021
Yeah, it turns out segfaulting once every couple of hours keeps a lid on memory usage. I did a linear regression of the data from 2021-12-16 04:00 to 21:50. mw1414 is leaking 122 MB/hour, and mw1415 is leaking 600MB/hour. Most likely there is a second smaller memory leak, and the task as described is resolved. I would suggest rolling out the patch. This is my last day before vacation. I can do another core dump analysis in the new year.
Dec 15 2021
I collected 37 segfault backtraces with systemd-coredump. All indicated that the problem is the lack of Dmitry's followup commit 5f47ce9127ec9123ba2359bb05cf180c8a4177b6. So I added that to my mysqlnd-leak-7.2-backport branch as 5f5ec42235cc15eacd86abe55ebb33bfb677a2ce. I'll do some more local testing but hopefully it will work now.
There are 20-30 segfaults per hour, so there's probably a bug with my backport. I'll see if I can figure it out. For now, don't upgrade the rest of the cluster.
The fix was a7305eb539596e175bd6c3ae9a20953358c5d677, which is in PHP 7.3, which caused mysqlnd to use the request-local allocator for result data even for persistent connections.
Reduced test case:
Dec 14 2021
I think the things being leaked are zend_string objects allocated by mysqlnd_wireprotocol.c line 1380. In the core dump they have a reference count of 1 but no reference to them exists:
As discussed at T296098, mw1414 is serving no requests but has high memory usage. Analysis of /proc/<pid>/maps shows that most of the memory usage is in the sbrk() segment which is only used by malloc(), not emalloc(), opcache or APCu. Dumping the segment from a random process and looking at random parts of it showed field name strings:
I'm moving my work on the root cause to T297667: mysqli/mysqlnd memory leak
I filed T297667 for the PHP bug which I'm working on.
The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. mmap() failure, which as I've said should be fixed by tuning the kernel. Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for years.
I don't think it's necessary to deprecate $wgWhitelistRead in order to prevent unauthorized users from executing arbitrary actions. We can just have actions opt in to being allowed in whitelist read mode, the same as we do for API modules with isReadMode():
Dec 13 2021
I dumped some random parts of the heap of a php-fpm7.2 process on mw1414. It looks like DB query results. Probably mysqli is leaking query results, hence when the query rate increases, the leak rate increases.
OK, I didn't realise mw1414 was depooled with high memory usage, that is useful. I looked at /proc/<pid>/maps. The heap segments, typically at 0x563908311000, account for 36GB of memory usage if you add them all together, out of 41GB total "used" memory. So I think it's probably coming from malloc(), not APC/APCu or emalloc().
So the wtp* servers were indeed out of memory, as reported at T296098. There were other consequences, for example:
It tried to get a core dump:
Dec 12 2021
It sounds like nobody has a theory as to how an increased query rate could lead to increased memory. It would have been nice if someone could have got a core dump of an affected process.
The message probably indicates that mmap() returned NULL when PHP tried to allocate memory. I don't know why that would happen. You would expect oom-killer to be invoked if the system is out of memory. The message is probably occurs randomly on a stressed system, rather than on a process that is using a lot of memory.
Dec 2 2021
PageImages loads the text of the page from the database during LinksUpdate. Presumably this bug would be fixed if it stopped doing that, as I proposed in T296895.
Nov 25 2021
Per my comment at T269459#7522285, Dodo has poor performance compared to the PHP DOM extension. I don't think it's fixable without changing some of the design goals. The main problem is increased memory usage, especially an increased number of countable objects, leading to substantial GC overhead. I don't think it is feasible to store the Parsoid DOM using pure PHP linked lists. My recommendation is for Parsoid to go back to using PHP DOM extension directly, and to instead consider a W3C-compliant layer wrapping DOM data structures for the benefit of extensions, if that is desired.
Nov 24 2021
I am not working on this right now.
Any weird data corruption issue which occurs when handling a timeout should be closed as a duplicate of T293568. It's unclear to me whether this bug qualifies since the original error was not inside the exception handler. But T254210#7164506 would certainly qualify.
Nov 23 2021
Nov 22 2021
I benchmarked Dodo integrated with Parsoid while parsing a realistic test case ([[Australia]] with local templates and no images), PHP 8.0.12 stock (no DOM patch).
Nov 17 2021
I updated the checklist. The next step is to do the migration in production. Wikidata reads revision rows from foreign databases, which implies that the configuration for all wikis should be kept in sync. So:
Nov 12 2021
So it turns out that Dmitry and Nikic were working on the same problem simultaneously, and a fix was just committed. I confirmed that https://github.com/php/php-src/commit/fa0b84a06b03a1c2a2bcadd647232a8a4a90aa05 can be cleanly applied to PHP 7.4, and it fixes the reduced test case. After applying the fix, I was able to send 600 timeout exceptions to Parsoid without it segfaulting or aborting, when compiled in debug mode with AddressSanitizer. Before the fix, it segfaulted after handling 5 timeouts.
Nov 11 2021
Nov 10 2021
I tried moving the opline assignment in ZEND_VM_JMP() to after the interrupt check, but in e.g. ZEND_JMPNZ, the opline has already been incremented by the time ZEND_VM_JMP() is called. It would be necessary to change how conditional jumps work somewhat.
Nov 9 2021
Minimal test case: