Page MenuHomePhabricator

Create linting rule for large tables
Closed, ResolvedPublic5 Estimated Story Points

Description

We'd like to add a linting rule that heuristically identifies large tables so that we can understand the problem better and find solutions. Since it's not 100% clear how to solve this problem, this should be a hidden linting rule so that editors are not made aware of pages with lint problems that they cannot fix.

Proposal

  • A linting rule exists that identifies "large" tables. A large table will be defined given the definition below.

Definition of large table

With Vector 2022's limited width any table with more than 5 columns is likely to be problematic so for the purpose of this lint, for now we're identify large tables using the following JavaScript (we can modify it later as we understand the issue better):

Array.from(document.querySelectorAll('table')).filter((table) => table.querySelectorAll('tr > td, tr > th')[0].parentNode.children.length > 5).length > 0

QA testing

Sign off steps

QA Results - Beta

ACStatusDetails
1T334528#8839063
2T334528#8839063

QA Results - Prod

ACStatusDetails
1T334528#8839110
2T334528#8839110

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ovasileva set the point value for this task to 5.Apr 13 2023, 4:48 PM

Hacked up a bit of test code in php to aid in the process of adapting the javascript snippet to be a linter rule in Parsoid using Parsoids DOM utils. Hoping Parsoid team members with more expertise in html pattern possibilities can comment on the snippet and this codes deficiencies or edge cases where it would fail to detect tables with more than 5 columns based on the html that can exist and be generated by Parsoid. This code can be run standalone :-)

<?php
/* Experimental php DOM manipulation code to detect tables with more than 5 cells per row. This will need
    to be recoded as a Parsoid linter rule using Parsoids DOMutils, but this code is handy as a stand alone testbed.
 * Shannon Bailey  4/20/23  Version 0.1  Wikimedia Foundation
*/

$testHtmlText = '<!DOCTYPE html>
<html>
   <head>
      <title>HTML Table</title>
   </head>
	
   <body>
      <table border = "1" width = "100%">
         <thead>
            <tr>
               <td colspan = "4">This is the head of the table</td>
            </tr>
         </thead>
         
         <tfoot>
            <tr>
               <td colspan = "4">This is the foot of the table</td>
            </tr>
         </tfoot>
         
         <tbody>
            <tr>
               <td>Cell 1</td>
               <td>Cell 2</td>
               <td>Cell 3</td>
               <td>Cell 4</td>
               <td>Cell 5</td>
               <td>Cell 6</td>
            </tr>
         </tbody>
         
      </table>
   </body>
	
</html>';

function processHTML( string $html ) {
    $doc = new DOMDocument();
    $doc->loadHTML('<?xml encoding="UTF-8">' . $html);

    /* Starting with Jon Robsons quick hack javascript to detect a table with more 5 cells wide
        Array.from(document.querySelectorAll('table')).filter((table) =>
            table.querySelectorAll('tr > td, tr > th')[0].parentNode.children.length > 5).length > 0

     for fun, I asked chatGPT to convert the javascript to PHP it and got this:

     $tables = array_filter( iterator_to_array( $doc->getElementsByTagName( 'table' ) ),
        function ( $table ) {
            return $table->getElementsByTagName( 'tr' )->item( 0 )->childNodes->length > 5;
    } );

    which works on test html if there is only a single <tr> inside a <table>, or the first one
    has more than 5 <td>, but fails when such a table is not the first <td>s due to the code using item(0).

    So I hacked that a bit more and this work for the test html, but I am concerned this is not
    correct for more patterns that can be generated by Parsoid, and this needs to be modified to use
    Parsoid DOM utils, not PHP's DOM library.
    */

    $tables = array_filter( iterator_to_array( $doc->getElementsByTagName( 'table' ) ),
        function ( $table ) {
            $count = 0;
            $subTables = $table->getElementsByTagName( 'tr' );
            $subTableCount = $subTables->length;
            for ( $i = 0; $i < $subTableCount; $i++ ){
                $item = $subTables->item( $i );
                $count += $item->childNodes->length > 5;
            }
            return $count;
        } );

    $hasTablesWithMoreThanFiveColumns = count( $tables );

    echo "the html has $hasTablesWithMoreThanFiveColumns tables with more than 5 columns\n";

//    echo $doc->saveHTML();
}

processHTML( $testHtmlText );

Change 912393 had a related patch set uploaded (by Mabualruz; author: Mabualruz):

[mediawiki/services/parsoid@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/912393

Change 912394 had a related patch set uploaded (by Mabualruz; author: Mabualruz):

[mediawiki/extensions/Linter@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/912394

Change 912395 had a related patch set uploaded (by Mabualruz; author: Mabualruz):

[mediawiki/core@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/912395

Change 912395 merged by jenkins-bot:

[mediawiki/core@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/912395

Change 912394 merged by Sbailey:

[mediawiki/extensions/Linter@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/912394

@Jdlrobson the category is disabled by default as requested, it need config patch to enable it.

Per the QA testing, we want the category accessible e.g. at URL https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors/large-tables but not visible on https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors (and also at production) for sign off for this ticket.

The config patch should not make the category appear for anyone who doesn't know the magic URL.

Change 913944 had a related patch set uploaded (by Mabualruz; author: Mabualruz):

[mediawiki/extensions/Linter@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/913944

Edtadros subscribed.

Test Result - Beta

Status:
Environment: beta
OS: macOS Ventura
Browser: Chrome
Device: MBP
Emulated Device:NA

Test Artifact(s):

QA Steps

✅ AC1: Visiting https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors should not show a section for "large-tables"

Screenshot 2023-05-04 at 9.22.36 PM.png (1×3 px, 504 KB)

❌ Visiting https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors/large-tables should list any articles which have large tables. This list should include https://en.wikipedia.beta.wmflabs.org/wiki/A_very_large_table
@Mabualruz I don't get any results, is there something I need to add to the search criteria?

Screenshot 2023-05-04 at 9.24.37 PM.png (1×3 px, 453 KB)

The Parsoid patch that identifies the lints is not yet merged and deployed. So, you should test once that is done -- the vendor patch will likely be Monday.

Change 915788 had a related patch set uploaded (by Mabualruz; author: Mabualruz):

[mediawiki/services/parsoid@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/915788

The patches will have final touches on Monday and will be ready for QA, Thanks for @Sbailey, and @ssastry. I will heave a pairing time with @ssastry Monday early to finalise the patches.

Change 916907 had a related patch set uploaded (by Mabualruz; author: Mabualruz):

[mediawiki/core@master] Wrong Key for messages

https://gerrit.wikimedia.org/r/916907

Change 912393 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Create linting rule for large tables

https://gerrit.wikimedia.org/r/912393

We need to check the possibility of performance when there is a very long table ( row count > 1000):

  • We can check the 1st header row and the 1st normal rows.
  • We can use row's childNodes-> length to reduce functional complexity

We check the number of rows before checking and we can decide to do a full scan or a smaller check depending on that. And that would require to define a threshhold.

This will be testable on the beta cluster after the content transform team have deployed, which should hopefully be shortly.

Change 917387 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a8

https://gerrit.wikimedia.org/r/917387

Jdlrobson updated the task description. (Show Details)

Change 917387 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a8

https://gerrit.wikimedia.org/r/917387

Edtadros added a subscriber: Mabualruz.

Test Result - Beta

Status:
Environment: beta
OS: macOS Ventura
Browser: Chrome
Device: MBP
Emulated Device:NA

Test Artifact(s):

QA Steps

✅ AC1: Visiting https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors should not show a section for "large-tables"

Screenshot 2023-05-09 at 3.45.47 PM.png (901×1 px, 156 KB)

✅ AC2: Visiting https://en.wikipedia.beta.wmflabs.org/wiki/Special:LintErrors/large-tables should list any articles which have large tables. This list should include https://en.wikipedia.beta.wmflabs.org/wiki/A_very_large_table

Screenshot 2023-05-09 at 3.44.12 PM.png (1×1 px, 186 KB)

Edtadros closed this task as Resolved.EditedMay 9 2023, 10:58 PM
Edtadros removed Edtadros as the assignee of this task.

Test Result - Prod

Status: ✅ mediawiki
OS: macOS Ventura
Browser: Chrome
Device: MBP
Emulated Device:NA

Test Artifact(s):

QA Steps

✅ AC1: Visiting https://en.wikipedia.org/wiki/Special:LintErrors should not show a section for "large-tables"

Screenshot 2023-05-09 at 3.56.34 PM.png (855×1 px, 142 KB)

✅ AC2: Visiting https://www.mediawiki.org/wiki/Special:LintErrors/large-tables should list any articles which have large tables.
I clicked on a few of the pages and verified that they contained large tables that were not easily viewed on mobile.

Screenshot 2023-05-09 at 3.56.50 PM.png (1×1 px, 196 KB)

Legoktm subscribed.

Please fill in https://www.mediawiki.org/wiki/Help:Extension:Linter/large-tables with details about the purpose of this lint and what editors are expected (or not) do to when encountering it. (main reason why I'm re-opening)

✅ AC1: Visiting https://en.wikipedia.org/wiki/Special:LintErrors should not show a section for "large-tables"

Screenshot 2023-05-09 at 3.56.34 PM.png (855×1 px, 142 KB)

I don't agree with this, I think trying to hide it is more confusing than not. Because it's in the database, it still shows up in ?action=info, and external tools like lintHint, etc. If it's not ready for editors yet, we can put it in a separate "Ignore" category or something.

✅ AC1: Visiting https://en.wikipedia.org/wiki/Special:LintErrors should not show a section for "large-tables"

Screenshot 2023-05-09 at 3.56.34 PM.png (855×1 px, 142 KB)

I don't agree with this, I think trying to hide it is more confusing than not. Because it's in the database, it still shows up in ?action=info, and external tools like lintHint, etc. If it's not ready for editors yet, we can put it in a separate "Ignore" category or something.

I made a similar comment in T334527, which is still open. I don't really know what "NOTE: Blocked on T334527" means at the top of this task, but that task is still open, and this one was closed. That confuses me. Maybe the "hiding" of this new Linter tracking flag was not yet fully and properly enabled prior to the flag's deployment?

I don't agree with this, I think trying to hide it is more confusing than not. Because it's in the database, it still shows up in ?action=info, and external tools like lintHint, etc. If it's not ready for editors yet, we can put it in a separate "Ignore" category or something.

@Legoktm can you move this feedback to T334527 which is the task we're using to talk about motivation here. To be clear this is temporary while we fine tune the linting rule which is throwing up some false positives.

Jdlrobson claimed this task.

Documentation has been add. Thanks for the prompt!

Large-tables lint still requires a follow on patch to avoid a serious performance problem when tables greater than 1000 rows are encountered, looking at the number of rows and if under some threshold, checking them all, and over that threshold, looking at the first few rows and header row.

Got it, excellent, your on it :-)
Will add myself as a reviewer if not already included.
Thanks

Change 915788 abandoned by Mabualruz:

[mediawiki/services/parsoid@master] Linter: Check and Optimise performance for new linting rules

Reason:

https://gerrit.wikimedia.org/r/915788

Change 915788 restored by Mabualruz:

[mediawiki/services/parsoid@master] Linter: Check and Optimise performance for new linting rules

https://gerrit.wikimedia.org/r/915788

Change 916907 abandoned by Mabualruz:

[mediawiki/core@master] Wrong Key for messages

Reason:

https://gerrit.wikimedia.org/r/916907