Page MenuHomePhabricator

Vector page layout corrupted on cached pages
Closed, ResolvedPublic

Description

https://en.wikipedia.org/w/index.php?title=SpaceX&action=edit

Postmortem

This regression was caused by the combination of the unsafe generalization of a CSS selector and the lack of HTML -> CSS version binding:

  1. #p-personal selector (applied only to the PersonalMenu) generalized to .vectorMenu-default (patch 594601)
  2. .vectorMenu-default renamed to .vector-menu, thus further generalizing the selector, mixing the meanings of ALL menus and ONE kind of menu (personal or default or horizontal -- no consensus yet, how it should be named) (patch 595001)
  3. The confusing naming leads to the addition of .vector-menu to ALL menus in HTML, thus applying the PersonalMenu's CSS rules to every menu (patch 589395)
  4. Absolute positioning of .vector-menu is removed in this patch (patch 589395).
  5. The lack of HTML version -> CSS version binding in ResourceLoader causes the CSS from the previous week (train) to be served from cache with the new HTML on quickly regenerated pages - such as the Main Page - for up to 30 minutes (AFAICT), applying the absolute positioning of .vector-menu from the previous deployment to all menus, thus collapsing those onto the logo.
What can be learned:
  1. HTML -> CSS,JS version binding is very important
    • to avoid the traps of inconsistent DOM, styling and code causing such incidents
    • to spare the complexity of creating CSS, JS and HTML that works with the previous deployed HTML and CSS,JS, respectively.
    • to spare the extra work of cross-testing old HTML with new CSS and vice versa (not done) and the management burden of removing selectors 2 weeks later
    • I reckon this is currently a limitation of ResourceLoader, that would be solved by building assets pre-deploy or by forcing RL to build the required assets before the HTML is served.
  2. Consistency in refactoring avoids confusion. Extending the definition of a CSS selector (the set of elements matched) changes the meaning, leading to confusion, to such bugs. Newly added classes should be clearly defined and distinguished from existing classes, to avoid this.

Event Timeline

Krinkle created this task.May 28 2020, 7:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 28 2020, 7:15 PM
Krinkle triaged this task as Unbreak Now! priority.May 28 2020, 7:15 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMay 28 2020, 7:15 PM

I saw this once or twice on Beta as well earlier this week, wasn't sure what caused it then.

Reproducible in safe mode as well, which rules out local wiki customisations and personal gadgets.

https://en.wikipedia.org/w/index.php?title=SpaceX&action=edit&safemode=on

jrbs renamed this task from Vector page layout corrupted on enwikii to Vector page layout corrupted on enwiki.May 28 2020, 7:19 PM
.vector-menu, #p-personal {
    position: absolute;
    
<div id="p-tb" class="vector-menu vector-menu-portal portal" aria-labelledby="p-tb-label" dir="ltr" lang="en-GB">
	<h3 id="p-tb-label">
		<span>Tools</span>
	</h3></div>

I don't see how that could work so yeah, clearly a breaking change that wasn't tested with forward and backward compat for cached CSS/HTML in mind.

Mentioned in SAL (#wikimedia-operations) [2020-05-28T19:20:39Z] <twentyafterfour> group2 back to wmf.32 due to T253905

Mentioned in SAL (#wikimedia-operations) [2020-05-28T19:20:45Z] <twentyafterfour@deploy1001> rebuilt and synchronized wikiversions files: roll back the train due to T253905

Jdforrester-WMF renamed this task from Vector page layout corrupted on enwiki to Vector page layout corrupted on cached pages.May 28 2020, 7:31 PM
Jdlrobson added a subscriber: Jdlrobson.EditedMay 28 2020, 7:33 PM
.vector-menu, #p-personal {
    position: absolute;
    
<div id="p-tb" class="vector-menu vector-menu-portal portal" aria-labelledby="p-tb-label" dir="ltr" lang="en-GB">
	<h3 id="p-tb-label">
		<span>Tools</span>
	</h3></div>

I don't see how that could work so yeah, clearly a breaking change that wasn't tested with forward and backward compat for cached CSS/HTML in mind.

@Krinkle I'm just catching up with this one however it doesn't make sense. The issue should only impact modern not legacy. Where are you seeing that CSS rule? That shouldn't be in legacy it should only be modern.
That CSS rule is only loaded if skins.vector.styles is loaded.

Is it possible a gadget is pulling in skins.vector.styles instead of skins.vector.styles.legacy ?

Reproducible in safe mode as well, which rules out local wiki customisations and personal gadgets.

https://en.wikipedia.org/w/index.php?title=SpaceX&action=edit&safemode=on

Just saw this.. that just adds to my confusion. Do you have a RAW HTML dump of the page?

Izno added a subscriber: Izno.May 28 2020, 7:37 PM
  • This happened on the default/legacy version. The skins.vector.styles.legacy module is loaded, not the newer one.
  • The HTML is generated by the server fresh. This is affecting only new renderings after the change (so it would spread further the longer it is deployed). The old/cached HTML only set vector-menu on #p-personal. But the broken/new version sets it on every portlet and tab area as well, thus causing the broken layout.
  • The CSS rule .vector-menu, #p-personal is/was part of the default Vector stylesheet. If you inspect any production page view now, and find #p-personal you'll see it.
Marc added a subscriber: Marc.May 28 2020, 7:50 PM

I noticed it too, on nl-wiki. As far as I could see in the short time it happened, it affected only action=history pages . The page I still have open mentions skins.vector.styles.legacy in the loaded styles:

RLSTATE={"ext.gadget.updatemarker":"ready","ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready",
"ext.globalCssJs.user":"ready","user":"loading","user.options":"loading","mediawiki.interface.helpers.styles":"ready","mediawiki.action.history.styles":"ready","mediawiki.special.changeslist":"ready","mediawiki.helplink":"ready","oojs-ui-core.styles":"ready","oojs-ui.styles.indicators":"ready","mediawiki.widgets.styles":"ready","oojs-ui-core.icons":"ready","mediawiki.htmlform.ooui.styles":"ready","mediawiki.htmlform.styles":"ready","mediawiki.widgets.DateInputWidget.styles":"ready","jquery.makeCollapsible.styles":"ready","mediawiki.ui.button":"ready","mediawiki.feedlink":"ready","skins.vector.styles.legacy":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.echo.styles.badge":"ready","oojs-ui.styles.icons-alerts":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready"};
<link rel="stylesheet" href="/w/load.php?lang=nl&amp;modules=ext.echo.styles.badge%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cmediawiki.action.history.styles%7Cmediawiki.feedlink%2Chelplink%7Cmediawiki.htmlform.ooui.styles%7Cmediawiki.htmlform.styles%7Cmediawiki.interface.helpers.styles%7Cmediawiki.special.changeslist%7Cmediawiki.ui.button%7Cmediawiki.widgets.DateInputWidget.styles%7Cmediawiki.widgets.styles%7Coojs-ui-core.icons%2Cstyles%7Coojs-ui.styles.icons-alerts%2Cindicators%7Cskins.vector.styles.legacy&amp;only=styles&amp;skin=vector"/>

Change 599413 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/Vector@wmf/1.35.0-wmf.32] HOTFIX: Do not apply p-personal absolute positioning to all menus

https://gerrit.wikimedia.org/r/599413

Change 599413 merged by jenkins-bot:
[mediawiki/skins/Vector@wmf/1.35.0-wmf.32] HOTFIX: Do not apply p-personal absolute positioning to all menus

https://gerrit.wikimedia.org/r/599413

Mentioned in SAL (#wikimedia-operations) [2020-05-28T20:38:40Z] <jforrester@deploy1001> Synchronized php-1.35.0-wmf.32/skins/Vector/resources/skins.vector.styles: T253905 HOTFIX: Do not apply p-personal absolute positioning to all menus (duration: 01m 07s)

Jdforrester-WMF closed this task as Resolved.May 28 2020, 8:54 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Provisionally claiming fixed. We'll see when we roll.

Demian added a subscriber: Demian.EditedMay 29 2020, 1:27 AM

The old/cached HTML only set vector-menu on #p-personal. But the broken/new version sets it on every portlet and tab area as well, thus causing the broken layout.

vector-menu class was added to all menus, both legacy and modern in:
https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/589395/39/includes/VectorTemplate.php#452
merged on May 13

The absolute positioning was removed from .vector-menu CSS in the same patch (this is common for legacy and modern):
https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/589395/39/resources/skins.vector.styles/Menu.less#b13

This was tagged with wmf.34, not wmf.32: T249372#6131667

This was a very weird and unexpected bug, it would benefit us to understand the cause of it.
These are the pieces of the puzzle:

  1. .vector-menu in wmf.32 had position: absolute rule (patch 32->34).
  2. in wmf.34 the position: absolute rule was removed from .vector-menu, separated into #p-personal in CSS (patch 32->34).
  3. in wmf.34 .vector-menu was added to all menus (including sidebar) in HTML (patch 32->34).

The result was clients showing the sidebar with .vector-menu class (new html generated by wmf.34), but position: absolute (old CSS from wmf.32).

It was my understanding that HTML invalidation might come later than CSS, but in this case it's the opposite.
What is the expert explanation of exactly what happened, causing this effect?
At which layer was the old CSS cached and how CSS gets invalidated at each layer?

Jdlrobson added a comment.EditedJun 1 2020, 7:39 PM

@Demian sharing my incident report that I shared to the team:

Old CSS can be shipped with new HTML for up to 10mins.
It means if you make HTML+CSS changes that impact anons you need to test your page on old CSS (I wonder if there’s a way we can setup something to help test this)
If we’d waited 5-10 mins, that bug would have gone away by itself. However, 5-10 mins of that is not acceptable.
To fix this, i had to patch 1.35.0-wmf.32 (note we were blocking wmf.34) - confusingly this is not the branch we deployed but the last branch. James is deploying my patch as we speak.
This fix will live in a branch for a mere 10 mins on production to allow cached HTML to update. It will be wiped out with the deploy of wmf.34 that we blocked.

It was my understanding that HTML invalidation might come later than CSS, but in this case it's the opposite.

HTML and CSS are cached separately. TTL for CSS is 5-10mins, so although new HTML was being rendered the old cached CSS was being served.

This is very useful, thank you! And very troubling as well. There's a lot to improve.

Old CSS can be shipped with new HTML for up to 30mins.

vs

TTL for CSS is 5-10mins,

Which one actually?
What about JS? Same as CSS?

Demian updated the task description. (Show Details)Jun 12 2020, 11:08 AM