Editor since 2005; WMF developer since 2013. I work on Parsoid and OCG, and dabble with VE, real-time collaboration, and OOjs.
On github: https://github.com/cscott
See https://en.wikipedia.org/wiki/User:cscott for more.
Editor since 2005; WMF developer since 2013. I work on Parsoid and OCG, and dabble with VE, real-time collaboration, and OOjs.
On github: https://github.com/cscott
See https://en.wikipedia.org/wiki/User:cscott for more.
I'm not in factor of T331906 at all. This task (or something like it) is a much better solution.
I also greatly prefer T122934 (or something similar) to this hacky workaround.
Got most of this done in 1.41; get/setFlag will have to wait until 1.42, and hard-deprecation of ::addJsConfigVars() is still awaiting review.
This notice is harmless, since OutputHooks hasn't been used in production since at least 1.38 (when deprecation notices were added to it). However @Nikerabbit points out on Slack that I really should have split this patch in two, and kept writing an empty array for $jsonData['OutputHooks'] until "the next train" from the patch which stopped reading $jsonData['OutputHooks'], so that we wouldn't get any of these notices if we needed to rollback. This was an oversight on my part, mea culpa.
Adding T346829 is a subtask since the HtmlHolder interface needs to be able to serialize/deserialize itself in order to be stored in ParserCache. Technically we could probably get away without this using explicit serialization/deserialization code in ParserOutput, but (as described in the HtmlHolder proposal) ideally we would like to be able to customize the on-disk representation for fast access and use independent from the details of the HTML string/DOM model formats defined by the HtmlHolder abstraction.
I'm not sure exactly what you mean here. Could you explain further? The parent task seems to be much more concrete.
Also worth noting: since LanguageConverter maintains persistent parse state, it should really not be a singleton the way that LanguageConverterFactory::getLanguageConverter() makes it, but instead should be instantiated and kept live only for the duration of a particular parse and then discarded. Probably a good step there would be adding a '$forceNew' or '$forceFresh` option to LanguageConverter::getLanguageConverter() and insisting that the parser (and/or anything else which can call convert or convertTo and thus add new rules to LanguageConverter state) start with a fresh language converter.
Yeah, the way LanguageConverter::getPreferredVariant() works breaks all sorts of abstractions, and I'm frankly surprised it works at all.
Sounds right to me. If possible you should use Parser::msg() to ensure that the page context language is set correctly. (This will be part of the Parser base class so you can use it on ParsoidParser as well sooner or later.) See T202481.
I believe the initial feedback on https://github.com/cscott/json-codec was that creating a new class codec for every class to be deserialized (as in https://github.com/cscott/json-codec/blob/f7bbad011d084dc4ede30bf6d1a47b853b8c64dd/src/JsonCodecableTrait.php#L48) might not meet the performance goals, but I think that can be easily patched.
In T13555#8325916, @cscott wrote:Proposed markup is, more-or-less:
<section> <div class="heading-stuff"> <h2>Heading</h2> <div>[edit] (etc)</div> </div> <div class="content-stuff"> Lorem ipsum... </div> <section> ... nested section... </section> </section>
It occurs to me that the legacy ID tags (with the alternate encoding) could also be added in read views, they are just stripped by VE as far as I know.
See also T13555: .mw-editsection links should not be part of the <h#> element which proposes a new organization for section edit links.
See also T151952 and linked tasks, including T90914. I suspect what we want is a proper "page layout specification" that you can "pour" the article text, sidebars, media etc into. But it's hard to find the right balance that allows some customization while maintaining a uniform overall appearance for the site which is compact enough to allow reasonably reskinning for print/mobile/etc (ie without having to recode a bazillion different page layouts).
^ above are some initial implementations of an API in OutputPage for discussion.
That seems correct.
cananian:~/Wikimedia/Extensions/DiscussionTools$ git log origin/REL1_39 commit 2d87dd9f54a9a0974a7b2750a90e13aa49ccbc6f (origin/REL1_39) Author: Translation updater bot <l10n-bot@translatewiki.net> Date: Tue Sep 12 07:51:44 2023 +0200
Since this is an older version of MediaWiki (1.39.x) I would first check that the version of DiscussionTools properly corresponds to the version of MediaWiki you are using. Version skew between Parsoid (which should be bundled with core) and the DiscussionTools extension would be a good explanation of errors of this sort.
These was an entry in 1.40, which links to on-wiki documentation of the changes:
* (T314318) $wgParserEnableLegacyMediaDOM – This setting has been changed, so the alternative modern HTML structure for media is now the default. You can disable it for now by over-riding this back to `true` in LocalSettings, but this configuration will be removed in future versions of MediaWiki. For more details, see the documentation at: https://www.mediawiki.org/wiki/Parsoid/Parser_Unification/Media_structure/FAQ
This was also mentioned back in 1.37:
* $wgParserEnableLegacyMediaDOM - This setting defaults to true, and enables the legacy media HTML structure in the output from the Parser. The alternative modern HTML structure for media is described at https://www.mediawiki.org/wiki/Parsing/Media_structure In a future release of MediaWiki this option will default to false, so it's a good idea to test this setting on your wiki early and report any issues.
The bug does appear in the mobile-html output:
https://en.wikipedia.org/api/rest_v1/page/mobile-html/Abbey_Road#Track_listing
https://www.mediawiki.org/wiki/Parsoid/fr is an example of translation inside SyntaxHighlight working as expected.
Interestingly, https://www.mediawiki.org/wiki/Parsoid/fr?useparsoid=1 works properly too. Worth investigation.
Ballpark figures: previously we found that 35% of the traffic hit the 1st level varnish cache, and that the misses from the 1st level cache caused a slowdown of ~20%. If a cache with a lifetime of ~1 week and size of ~8.2GB results in a 2nd level hit rate of 78%, then we expect that 14% ((1-.35)*(1-.78)) of the traffic still misses. If that 14% causes a 20% slowdown (and all the cache hits cause 0% slowdown), waving hands a bit, then we'd expect something like 3% overall slowdown after deploying this cache. Could be some surprises in p10 vs p90 etc if it turns out that our misses turn out to cluster on the tail of the latency distribution, but there's no particular reason to think that would be the case.
Probbaly caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/946633 as @Jdlrobson notes. Worth pointing out that this marked an underlying bug in Scribunto, where some of the errors on the page were being lost because ScribuntoErrors was being overwritten. Will look into this a little bit to make sure I understand fully what's going on before reverting.
In T345319#9140877, @matmarex wrote:
There's another bug related to PREG_BACKTRACK_LIMIT_ERROR -- T341320. The fix for this (an incorrectly too-low backtrack limit) was deployed relatively recently; worth a double check that this isn't the same cause.
This seems like "working as designed" to me. The whole point of the angle-brackets-extension-tag syntax in wikitext is that *everything inside is ignored* until you get to the end tag. That allows you to encapsulate arbitrary content inside an extension tag, which is pretty fundamental. The {{#tag}} syntax is explicitly intended to be the opposite: to allow exposing the contents to the wikitext parser (so that <noinclude> etc work). So I'd lean strongly toward "won't fix" here.
Presumably the same hooks we use for pregeneration could be preserved, just hollowed out so that they "only" did cache invalidation?
Note that DT really cares a lot that the post-DT output is cached; see discussion in https://www.mediawiki.org/w/index.php?title=Parsoid%2FOutputTransform&diff=6086524&oldid=6051987 -- the experimental hook will probably not be cached, and that will have to be resolved before read views deployment.
The full proposal is at https://www.mediawiki.org/wiki/Parsoid/OutputTransform, "Add a hook in FlavorDispatcher to allow inserting additional passes into the chain for any flavor; this will replace the ParserOutputAfterTidy hook."
Briefly summarizing a discussion on CTT's tech forum, there are some specific changes which could move Parsoid and legacy output closer together:
T344032: CI failing on wmf/1.41.0-wmf.20 due to Parsoid\Config\SiteConfigTest is another instance of this issue, with a little wrinkle. In addition to composer install taking the wrong versions in the window between our tagging a new Parsoid release and merging the corresponding patch in mediawiki-vendor, in T344032 as I understand it the wmf.21 train was delayed, and so patches were backported to wmf.20 even after Parsoid had tagged and released for wmf.21. As a result, the composer install step in the wmf.20 CI ended up using the wmf.21 version of Parsoid instead of the proper version from mediawiki-vendor.
In T344032#9086445, @Umherirrender wrote:Parsoid seems not use semver right now, so pinning the version would help for master and wmf branches when bringing out breaking changes, but makes the release process complicated.
Ok, let me try to see what's going on. Parsoid tagged -a21 and released it to mediawiki-vendor on Monday (https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/946615), well before the wmf.20 branch I believe. There's a "well known" bug in CI that means certain jobs don't run from mediawiki-vendor: T287419: `mediawiki-core-php72-phan-docker` job runs `composer install` instead of using packages from mediawiki/vendor. As a result, we broke CI for some period between when we tagged -a21 and when we actually merged it into mediawiki-vendor. Merging into mediawiki-vendor was "slow" (took a couple of hours) because there were some leftover dependencies in core on the old version, which the gate-and-submit for mediawiki-vendor correctly discovered. Once the mediawiki-vendor job merged, there weren't any further CI issues, AFAIK.
Sorry for the phab spam, the above patches belonged with T343849.
IMO this is a feature request ("detect a direct browser request for an .ogg file and transparently redirect to a transcoded version") which needs to be refined a bit further before it is actionable. I'm not convinced this is a parser issue -- it seems perhaps this is something the iOS app could do and/or the media servers.
When the author puts a [[Media:]] link, they are explicitly requesting a direct file link. It's not entirely clear to me that we should be excessively clever here and in the process deny Safari users any way to download the original .ogg file (and perhaps play it with a proper app).
The selector seems to be broken and shows trwiki as having no feed content, even though it is enabled on trwiki.
I guess we need a stylesheet update in PCS?
Oh, wow, that sure /looks/ like a bug in ::stripAllTags(), but in fact is exactly as it is documented (as you linked):
/** * Take a fragment of (potentially invalid) HTML and return * a version with any tags removed, encoded as plain text. * * Warning: this return value must be further escaped for literal * inclusion in HTML output as of 1.10! * * @param string $html HTML fragment * @return string * @return-taint tainted */ public static function stripAllTags( $html ) {
I'm contemplating changing the first parameter of LoginSignupSpecialPage::showSuccessPage() from string|Message to just plain Message, at least in part to improve phan-taint-check's accuracy: T343849. I could use some help understanding the impact of this bug, however (and some similar code in SpecialContributions). It seems to be that either you're double-escaping here (maybe deliberately?) or we should be using something other than Message::text() when OutputPage::setPageTitle is given a Message as its argument.
Note that we likely *do* support the {{DISPLAYTITLE}} markup, since that is processed by CoreParserFunctions in integrated mode and writes directly into the ParserOutput. However, we should double check that handling while we're writing test cases for -{T|....}-.
The spans were added to the displaytitle in T306440 (f7158c396d376fa12689c39fc3e7b3fffe34c184).
Hm, weird. Can't reproduce it locally even if I crank pcre.backtrack_limit all the way down to 9, although it does crash if pcre.backtrack_limit is 8 -- but the default PHP value is 1000000.
This should be the HTML in question:
ooh, i would love to see the HTML involved in this one, since it looks like an "ordinary" call to ::parseHTML triggered it.
Possibly a transient during production scap when the library is briefly out of sync? Although there haven't been any change to the langconv library since Jan 30, 2023, so I don't see how we've even get into a bad sync situation.
The fact that logged out users (and other low-priority accesses) can possibly see "old" revisions is a deliberate decision for performance reasons. There is a flag to insist that a request go to the primary DB (not a secondary read replica) which we use in some very specific places to ensure that we don't run into this problem in editing workflows, but in general APIs are allowed to return stale revisions to protect against the Michael Jackson effect, as @Krinkle notes.
Related to a huge number of other bugs making tweaks in summary stripping, eg T330188: Remove duplicate parenthesis stripping in /page/summary logic.
See T259893: Content template using EasyTimeline not working properly in the app, which seems to indicate an incompatibility with the <area> tags generated by EasyTimeline and Parsoid output. If so then "the legacy HTML is good enough" as @ssastry wrote above might /not/ be accurate? Needs investigation.
The actual "bug" here is "Parsoid support for Extension:EasyTimeline"; I don't know if that's been created yet.
From RefGroup.php in the Cite implementation:
if ( $refContentId ) { // `sup` is the wrapper created by Ref::sourceToDom()'s call to // `extApi->extTagToDOM()`. Only its contents are relevant. $sup = $extApi->getContentDOM( $refContentId )->firstChild; DOMUtils::migrateChildren( $sup, $reftextSpan );
the refContentId is mwf6 which *looks* reasonable as a content id, but it is not being found in the look up. It could possibly be due to VE giving us bogus input, but it's also possible it's our fault. To inventigate.
I wonder if this would work if you used Accept-Language: zh-hant-tw instead of zh-tw? Both ought to be valid, but the former is the "more correct" BCP-47 code.
Sorry, I did a quick first draft of this, patch above. It still needs some work, though, so happy to hand it off:
In T343226#9061363, @Tgr wrote:TemplateStyles parsing is not actually context-independent (as we want the first occurrence of a <templatestyles> reference to a given CSS page to turn into a <style> tag but any subsequent ones into a no-op reference to that styles tag) and Parsoid will have to accommodate that somehow, but this specific attribute is not related to that.
One strawdog proposal is something like:
{{#wrapLang|<new-lang-code>|<content>}}
(which improves with heredocs, T114432)
which, in addition to properly setting the lang and dir tags on a <div> or <span> wrapper around the content, would also reset the Parser::getTargetLanguage() when parsing the content.