Page MenuHomePhabricator

Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer)
Closed, ResolvedPublic

Assigned To
Authored By
Dzahn
Sep 6 2022, 11:19 PM
Referenced Files
F36153930: Screenshot_20230112_104712.png
Jan 12 2023, 9:56 AM
F36067467: Screenshot_20230109_121533.png
Jan 9 2023, 11:22 AM
F35968129: Screenshot from 2023-01-03 14-43-52.png
Jan 3 2023, 10:44 PM
F35968127: Screenshot from 2023-01-03 14-40-11.png
Jan 3 2023, 10:44 PM
Restricted File
Sep 6 2022, 11:20 PM
Tokens
"Love" token, awarded by Slaporte.

Description

enwiki changed its footer and license url, leading to further changes and subsequent alerts. We need to find the right balance between flexibility and legal requirements.

Original description:

We have specific monitoring for legal to make sure the projects have the correct legal / license footers.

https://en.wikibooks.org apparently has recently changed the footer leading to this alert:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=en.wikibooks.org

"Textsissavailablesundersthe <ashref="//creativecommons.org/licenses/by-sa/3.0/">CreativesCommonssAttribution-ShareAlikesLicense.</a>: additionalstermssmaysapply. html not found"

Current text is:

Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy.

The similar check for en.wikipedia is not affected.

{F35511388}

Event Timeline

Due to this diff; it's just a punctuation fix, not anything that modifies the meaning of the footer.

@Vahurzpu Thank you! ACK

It's not the first time it happens due to small changes. Part of this ticket was to point out that every time there is any minimal change this causes a false alert that looks pretty big (CRIT in prod monitoring).

Then it causes need for another code change to adjust to the wiki status quo.

fgiunchedi triaged this task as Medium priority.Sep 7 2022, 9:50 AM

@fgiunchedi Maybe it's also ok if we just rename this ticket to "move those alerts to alertmanager" to recycle it? Would that be desired?

I would have fixed it if there were separate definitions for the valid text for each of the projects.

But there isn't. The check_legal_html.py expects all projects to have the same license.

So I think wikibooks should just revert their change instead of us trying to accomodate for it.

@Vahurzpu cc: to the above. I don't know what to do about it, I am just reporting it and would have made an easy fix but it's not an easy fix. We would have to implement separate checks per project.. unless you guys just revert it on wikibooks or the observability team wants to replace the whole script with new code and move it to alertmanager. So I am more or less out here.

After pondering this a bit more I now think the _actual fix_ would be if Wikipedia and other projects just also fix the same punctuation issue that Wikibooks fixed. Then all projects are the same again, we adjust the monitoring with a 5 min change and we can close this.

But who could make that happen on the wikis?

Change 868037 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] icinga: Make the punctuation error optional on check

https://gerrit.wikimedia.org/r/868037

After pondering this a bit more I now think the _actual fix_ would be if Wikipedia and other projects just also fix the same punctuation issue that Wikibooks fixed. Then all projects are the same again, we adjust the monitoring with a 5 min change and we can close this.

But who could make that happen on the wikis?

While I agree, let's be practical and compromise, let's make the dot optional so the alert works as intended, resolve the technical issue (above patch) and then issue a recommendation to update it to legal- which they will be able to do it and not alert.

Change 868037 merged by Dzahn:

[operations/puppet@production] icinga: Make the punctuation error optional on check

https://gerrit.wikimedia.org/r/868037

While I agree, let's be practical and compromise, let's make the dot optional so the alert works as intended

Agreed, thanks @jcrespo. After I saw today a +1 from @fgiunchedi on your patch I deployed it to prod Icinga. Thanks both!

Shortly after:

22:41 <+icinga-wm> RECOVERY - Ensure legal html en.wb on en.wikibooks.org is OK: all html is present. https://phabricator.wikimedia.org/project/members/28/
Dzahn claimed this task.

before:

Screenshot from 2023-01-03 14-40-11.png (377×1 px, 42 KB)

after:

Screenshot from 2023-01-03 14-43-52.png (372×770 px, 27 KB)

P.S. shortly after this got an "out of office" agent reply from someone in legal. So confirmed this is actively alerting legal.

Thank you, while I understand why this was left open- sometimes a partial fix may only make things worse- in this specific case, I think this was a requirement to now open a ticket to suggest legal+mw devs to change the text to the right spelling everywhere and not alert (or revert). We can also take to them if they want to alert and how the next time it changes.

Wait, did my patch broke the check for enwiki? @Dzahn ?

Screenshot_20230109_121533.png (89×2 px, 31 KB)

jcrespo renamed this task from en.wikibooks.org has changed legal footer to Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer).Jan 9 2023, 11:49 AM
jcrespo reopened this task as Open.
jcrespo updated the task description. (Show Details)

Increasing the scope to make sure the alert doesn't reoccur. I have sent a message on the change request, hoping someone from the community can weight in.

I will now ping legal so they can also join the conversation.

As the WMF-Legal project tag was added to this task, some general information to avoid wrong expectations:
Please note that public tasks in Wikimedia Phabricator are in general not a place where to expect feedback from the Legal Team of the Wikimedia Foundation due to the scope of the team and/or nature of legal topics. See the project tag description.
Please see https://meta.wikimedia.org/wiki/Legal for when and how to contact the Legal Team. Thanks!

Let's wait and see what is legal's opinion first, and then we can plan either a technical or admin editing procedure solution to editing the footer. :-)

Thanks for resolving this while I was out. There are no legal concerns with making the period optional in the footer.

I'd also be open to other solutions for monitoring (or protecting) the legally significant parts of the footer. It doesn't necessarily need to be through this system.

@Slaporte - I am thinking of modifying the script to check that the important bits are there, but the wording can change without us being notified, so small punctuation can be changed or things renamed. That way we don't receive an alert every time a dot or a space is changed. The last issue was the License text article was renamed, and that makes the alert fail. Assuming that is possible, what we can ask the editor community is to run the script to make sure it doesn't fail after editing, and only if it fails (because substantial changes) to file a ticket to warn us (operations) and we can then notify you if the changes are non-trivial.

I will try to code a more robust check and see if everybody is ok with the changes. CC @Xaosflux

Change 878010 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] check_legal_terms: Refactor check to make it more robust against changes

https://gerrit.wikimedia.org/r/878010

So the above is my proposal for the check: rather than checking exact wording, check for the existence of keywords, and actually follow the links and check that what is linked also has all expected keywords.

So for example, the copyright check only checks that on the footer there is a link containing the words "creative", "commons" "attribution" and "sharealike" OR "CC" "BY" and "SA", and it links to the version 3 of the CC-BY-SA in the creative commons website, or to another location (e.g. internal link) containing the words 'license', 'creative', 'commons', 'share', 'remix', 'attribution', 'share alike', and '3.0'.

This is a better check than before in several ways:

  • The text can be slightly altered (e.g punctuation) as long as it always keeps the key words
  • The content linked is checked (this is new!) allowing the internal links to be renamed, but making sure the right keywords are also on the linked license
  • No more regexes, which are not a good way to parse HTML

The checks still apply in this way for the 4 checks implemented: copyright/license, terms of use, privacy policy and Wikipedia trademark. I think this should make Legal happy, as we have increased the reliability of the checks; make the editing community happy, because they are now able to move the internal link to the license or fix formatting and punctuation issues; and make us operators happy because we won't be alerted every time one of those minor changes happen.

The script also has improved logging and debugging so it is easy to detect what check is failing if it does in the future.

I would like first a technical review from @Dzahn and/or @fgiunchedi, and if they like it an explicit ok from all parties here would be nice- I personally think this is a better check system than before (although a bit more complex to follow).

This is an example of its execution in verbose mode, you can see it is able to find and crawl the referred texts:

python3 modules/icinga/files/check_legal_html.py -site=https://en.wikipedia.org/wiki/Main_page -ensure=desktop_enwp -v
2023-01-10 15:16:50,744 INFO: Checking site: https://en.wikipedia.org/wiki/Main_page
2023-01-10 15:16:50,744 INFO: Downloading website: https://en.wikipedia.org/wiki/Main_page
2023-01-10 15:16:51,040 INFO: Executing check of copyright...
2023-01-10 15:16:51,040 INFO: Checking text: "creative commons attribution-sharealike license 3.0"
2023-01-10 15:16:51,040 INFO: Expected word creativecommons.org is missing!
2023-01-10 15:16:51,040 INFO: Downloading website: //en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
2023-01-10 15:16:51,343 INFO: copyright html found correctly
2023-01-10 15:16:51,343 INFO: Executing check of terms...
2023-01-10 15:16:51,343 INFO: Checking text: "Terms of Use"
2023-01-10 15:16:51,344 INFO: Downloading website: //foundation.wikimedia.org/wiki/Terms_of_Use
2023-01-10 15:16:51,608 INFO: terms html found correctly
2023-01-10 15:16:51,608 INFO: Executing check of privacy...
2023-01-10 15:16:51,608 INFO: Downloading website: //foundation.wikimedia.org/wiki/Privacy_policy
2023-01-10 15:16:51,980 INFO: privacy html found correctly
2023-01-10 15:16:51,980 INFO: Executing check of trademark...
2023-01-10 15:16:51,980 INFO: Checking Wikipedia® trademark mention...
2023-01-10 15:16:51,980 INFO: trademark html found correctly
All legal html excerpts are present for https://en.wikipedia.org/wiki/Main_page (desktop site): copyright, terms, privacy, trademark

Thank you, while I understand why this was left open- sometimes a partial fix may only make things worse

Yea, well, that was my opinion as well but you said "let's be pragmatic instead' so I took that as kind of the opposite.

Thanks for following up on the process!

I would like first a technical review from @Dzahn and/or @fgiunchedi

I like it and thanks for suggesting it! I also would like an approach where SRE isn't involved at all and it's just "alert goes to legal, legal checks with community".

That being said I'm totally ok with it either way and think it's more for observability to agree and at some point move the check out if Icinga, I assume.

Dzahn removed Dzahn as the assignee of this task.Jan 10 2023, 3:56 PM

@Slaporte I got the technical ok to deploy the new version check. Here is how it works. Rather than checking exact phrases, it search for links with key words AND checks the content of those links (thanks to the useful feedback from @Xaosflux !), so if in the future the pages linked are moved without updating the footer, the content is still checked correctly.

To get your ok, I have made some mockup pages with examples. The first one is a correct one, a copy of the current English Wikipedia footer, the other contains missing data, sometimes not trivial to check by a human.

The pages live at https://people.wikimedia.org/~jynus/ :

As a test, the check fails everywhere except for the correct one:

$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/good.html -ensure=desktop_enwp
All legal html excerpts are present for https://people.wikimedia.org/~jynus/good.html (desktop site): copyright, terms, privacy, trademark
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/missing_license.html -ensure=desktop_enwp
ERROR: copyright html not found for https://people.wikimedia.org/~jynus/missing_license.html (desktop site).
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/bad_license_link.html -ensure=desktop_enwp
ERROR: copyright html not found for https://people.wikimedia.org/~jynus/bad_license_link.html (desktop site).
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/missing_privacy.html -ensure=desktop_enwp
ERROR: privacy html not found for https://people.wikimedia.org/~jynus/missing_privacy.html (desktop site).
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/bad_privacy_link.html -ensure=desktop_enwp
ERROR: privacy html not found for https://people.wikimedia.org/~jynus/bad_privacy_link.html (desktop site).
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/missing_terms.html -ensure=desktop_enwp
ERROR: terms html not found for https://people.wikimedia.org/~jynus/missing_terms.html (desktop site).
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/bad_terms_link.html -ensure=desktop_enwp
ERROR: terms html not found for https://people.wikimedia.org/~jynus/bad_terms_link.html (desktop site).
$ python3 modules/icinga/files/check_legal_html.py -site=https://people.wikimedia.org/~jynus/missing_trademark.html -ensure=desktop_enwp
ERROR: trademark html not found for https://people.wikimedia.org/~jynus/missing_trademark.html (desktop site).

Let me know if this looks like reasonable, at least it looks like in line with the guidance given at T108081, but please confirm.

Change 878010 merged by Jcrespo:

[operations/puppet@production] check_legal_terms: Refactor check to make it more robust against changes

https://gerrit.wikimedia.org/r/878010

Change 879280 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] icinga: Add BeautifulSoap4 python dependency for check_legal

https://gerrit.wikimedia.org/r/879280

Change 879280 merged by Jcrespo:

[operations/puppet@production] icinga: Add BeautifulSoup4 python dependency for check_legal

https://gerrit.wikimedia.org/r/879280

@Xaosflux So in the end, no change of procedure is needed for each community. With the improved monitoring tool, we will monitor the footers with the new checks, which should be more robust against trivial changes, like punctuation, or license page renames. If there is a change that the tool considers problematic, we will now setup the alert to ping legal, and they will contact the community to address the problems (only for actual issues, like a missing copyright or terms of service).

Pending only now to close this ticket is to connect those alerts with the legal contact point.

Change 879601 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] icinga:Update legal check to link to wikitech and add legal contact

https://gerrit.wikimedia.org/r/879601

Change 879601 merged by Jcrespo:

[operations/puppet@production] icinga:Update legal check to link to wikitech and add legal contact

https://gerrit.wikimedia.org/r/879601

So the updated alarm has been deployed. Now the ticketing system from from legal will be targeted to avoid a SPOF, although leaving for now @Slaporte's contact, just in case (we can remove it later much more easily). Only Slaporte has for now access from legal to ack/disable/etc the check (in addition to other global root admins).

Thanks for everybody in legal, SRE and the community that helped making wikimedia safer and more reliable, be it in code, reviews or feedback. The fact that you quickly jumped to help was much appreciated on fixing this long due piece of technical debt.

Obviously the check could improve in the future to be more thorough, complete and bug-free, but I think the scope of this ticket was satisfied! Thank you again.