Page MenuHomePhabricator

Add Extension:WikiSEO to English Wikiversity
Open, Stalled, Needs TriagePublic

Description

Could the extension WikiSEO be activated on en.wikiversity.org?

It'll be useful for adding <meta name="citation_.*"> tags to projects such as WikiJournals and Eventmath.

(Discussion of configuration & vote here).

Event Timeline

Aklapper changed the task status from Open to Stalled.Nov 3 2021, 7:51 AM

@Thomas_Shafee: Setting project to Wikimedia-Site-requests, as this request is about settings / configuration of a Wikimedia website. In the future, please follow https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes when requesting such site configuration changes.
As this codebase is not yet on any Wikimedia servers, please see https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment for required steps.

Also, this seems to be only about one Wikiversity, if I understand correctly?

Hi @Thomas_Shafee, could you provide the link to the discussion made about adding this extension? And could you tell us if the task is for en.wikiversity only?

Aklapper renamed this task from Adding Extension:WikiSEO to Wikiversity to Add Extension:WikiSEO to English Wikiversity.Nov 4 2021, 9:45 AM

It would probably make more sense to make a specific extension for that then use wikiseo (imho)

Is there capacity for someone to make a dedicated extension? Essentially all that needs to be done is to pull relevant fields from the wikidata item linked to a particular page and add a set of standardised <meta name="citation_.*"> tags. For example:

<meta name="citation_doi" content="{{{doi| {{#invoke:WikidataIB |getValue |qid={{{Q|{{PAGEQID}}}}} |P356 |fetchwikidata=ALL |onlysourced=no |noicon=true |maxvals=1}} }}}">

<meta name="description"           content="...">
<meta name="citation_abstract"     content="...">
<meta name="keywords"              content=""..."">
<meta name="citation_doi"          content="...">
<meta name="citation_title"        content="...">
<meta itemprop="name"              content="...">
<meta name="citation_journal_title" content="...">
<meta name="citation_journal_abbrev" content="...">
<meta name="citation_date"         content="...">
<meta name="citation_firstpage"    content="...">
<meta name="citation_issue"        content="...">
<meta name="citation_volume"       content="...">
<meta name="citation_issn"         content="...">
<meta name="citation_publisher"    content="...">
<meta name="citation_pdf_url"      content="...">
<meta name="citation_article_type" content="...">
<meta name="dc.identifier"         content="...">

I'd thought that'd be easiest via the existing wikiseo extension, but I'm open to a bespoke extension if that's easier.

Is there a particular reason why WikiSEO is proposed, and not another extension?

Hey Thomas,

First of all, let me be very honest in this message. I know you probably won't like what I'm going to say here, but please, note that I am writing here in my volunteer capacity (together with most other people commenting here), so please keep in mind to not shoot the messenger :-).

In my opinion, both options (using an extension that exists, but is not in use by any of the Wikimedia projects; or writing a new extension from scratch) are quite unrealistic (for similar reasons).

Let me quickly describe how it works. The first step is writing the code of the extension. That's actually one of the easier parts -- writing an extension for this usecase isn't hard (and for more complex usecases, that only means the next steps are even harder :)). After the code is created, it needs to be carefully examined: mainly, for security and performance. This step is more difficult, mainly because there's a set of people that can do such task. While anyone with enough knowledge can download the extension and perform their own security review (for example), only the Wikimedia security team can give an authoritative assessment of the code. All those reviews can easily take several months or even years: for an example of a recent security review, see T269291 (and that was a project backed up by a WMF team, and partially taken from an already-reviewed extension).

Once all the reviews are completed, all requested changes are made, the extension is ready for deploying. However, that's not the end of story: once an extension is deployed, it needs to be maintained. In another words, if it is found to cause stability, security or other issues, it needs to be fixed ASAP. Otherwise, the extension will likely get undeployed again.

Just to make sure I'm not gotten wrong: I'm not saying this all can't happen at all. Semi-recently, there was a couple of new extensions deployed, and some of them are purely volunteer-led. So, it's definitely possible, but it's quite hard.

@Aklapper listed a bunch of other extensions, but as far as I can see, none of them is currently in use at the Wikimedia projects, which means they likely didn't pass any of the reviews (or started the review process).

In the light of the facts described above, let me ask two important questions:

  • Is there someone who would be both willing and able to take the extension through all the steps and do post-deployment maintenance?
  • Could you describe what's the use-case for the tags? In another words, what do you expect to happen? There might be a simpler solution than a (new) extension, but that's hard to think about without knowing the wider context.

If you have any questions, feel free to ask -- I will be happy to answer them.

Best wishes,

Martin Urbanec

Maybe I can add some info as the developer of this extension. I've forked the code around summer 2019 and have since rewritten it completely and deployed it on personal wikis.

Regarding bugs and the question of security, I try to respond to issues asap, as WikiSEO is deployed on wikis I maintain.
To my knowledge Miraheze has this extension deployed site wide without any issues about performance or security.

That being said, I am already maintaining this extension and planing to do so for the foreseeable future.

I’d more than welcome an official security review from wmf.
Plus a lot of points mentioned in (https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes) are already done.

I had a short chat with @sbassett but with my Miraheze Sysadmin hat on we've had no issues :)

Hello @Octfx, glad to hear the extension has a maintainer .

I very quickly skimmed through the extension's code and played with it at my dev wiki. I found several things that I don't understand. Of course, it's by no means a full review -- it's only few things I noticed after I spent few minutes with the code.

Possible performance issues
  • Unless I miss something, PageHooks::onBeforePageDisplay() appears to always trigger a query like: select pp_propname, pp_value from page_props where pp_page=XXX anytime I display the main page, even though I have memcached enabled at my dev wiki. Always running a DB query on a pageview looks like a bad pattern. Is there any reason to not implement a caching mechanism?
  • Similarly, PageHooks::onRevisionDataUpdates() runs DeferredDescriptionUpdate. That class always runs a DELETE and then INSERT, although the description might still be the same after the edit gets saved. It's always a good idea to avoid queries at the primary DB server. It's reasonably easy to scale replica queries if needed (by buying more replicas), but you can't scale primary queries -- there's always only one primary server. I recommend to run a SELECT on a replica first, compare with the new description and either run an UPDATE or an INSERT. That way, for edits that didn't update the description, primary DB won't be hit with any queries (by the extension). For edits that do change/set a description, primary DB gets only one query (instead of current two per each edit).
Miscellaneous

Docs for loadPagePropsFromDb indicate it limits the properties returned to Validator::$validParams, but that doesn't appear to happen:

$ php maintenance/shell.php --wiki=awiki
Psy Shell v0.10.9 (PHP 7.2.34-23+0~20210701.63+debian10~1.gbpd7cd48 — cli) by Justin Hileman
>>> use MediaWiki\Extension\WikiSEO\WikiSEO;
>>> $seo = new WikiSEO();
=> MediaWiki\Extension\WikiSEO\WikiSEO {#3046}
>>> sudo $seo->loadPagePropsFromDb(60)
<warning>PHP Notice:  unserialize(): Error at offset 0 of 3 bytes in /home/urbanecm/unsynced/gerrit/mediawiki/extensions/WikiSEO/includes/WikiSEO.php on line 235</warning>
=> [
     "wikibase_item" => "Q43",
   ]

Plus, it yields a notice, which would likely cause logspam in WMF production (see Why is logspam an problem for details).

Thanks for those insights @Urbanecm!
This sounds like unintended behavior, I’ll address this shortly.
Some aspects of the extension may be in need of updating, as world wide scalability wasn’t one of my concerns when writing the code.

@Urbanecm Thanks for the clear reply - this is really useful for refining what info is needed.

Is there someone who would be both willing and able to take the extension through all the steps and do post-deployment maintenance?

  • I suspect @Octfx's answer above answers this better than I can. However I'd like to add that if it's easier for a more streamlined and single-use-case extension to be written by making a version of WikiSEO with most of the functions cut away, I'd be willing to contract them as a developer.

Could you describe what's the use-case for the tags? In another words, what do you expect to happen? There might be a simpler solution than a (new) extension, but that's hard to think about without knowing the wider context.

@Aklapper wrt to wikiSEO as the preferred extension - I was pointed to it by, and found it to have an active maintainer (@Octfx).

@Octfx FYI, while developping sth totally unrelated, tests started to suddenly fail:

[exception] [error] [{reqId}] {exception_url}   PHPUnit\Framework\Error\Deprecated: Use of ParserOutput::getProperty was deprecated in MediaWiki 1.38. [Called from MediaWiki\Extension\WikiSEO\Hooks\PageHooks::onRevisionDataUpdates in /home/urbanecm/unsynced/gerrit/mediawiki/extensions/WikiSEO/includes/Hooks/PageHooks.php at line 85] {"exception":{},"exception_url":"[no req]","reqId":"26b72845f7211dafe4b2947f","caught_by":"other"}

This also needs to be fixed prior to any potential deployment (and perhaps more deprecated calls exist).

Possible performance issues
  • Unless I miss something, PageHooks::onBeforePageDisplay() appears to always trigger a query like: select pp_propname, pp_value from page_props where pp_page=XXX anytime I display the main page, even though I have memcached enabled at my dev wiki. Always running a DB query on a pageview looks like a bad pattern. Is there any reason to not implement a caching mechanism?

Adding cache should be relatively straight forward and could be done in WikiSEO::setMetadataFromPageProps. I was under the impression that querying the page props was reasonably fast, but I see the issue with calling the DB on each view.

  • Similarly, PageHooks::onRevisionDataUpdates() runs DeferredDescriptionUpdate. That class always runs a DELETE and then INSERT, although the description might still be the same after the edit gets saved. It's always a good idea to avoid queries at the primary DB server. It's reasonably easy to scale replica queries if needed (by buying more replicas), but you can't scale primary queries -- there's always only one primary server. I recommend to run a SELECT on a replica first, compare with the new description and either run an UPDATE or an INSERT. That way, for edits that didn't update the description, primary DB won't be hit with any queries (by the extension). For edits that do change/set a description, primary DB gets only one query (instead of current two per each edit).

This is fixed in the latest commit.

Miscellaneous

Docs for loadPagePropsFromDb indicate it limits the properties returned to Validator::$validParams, but that doesn't appear to happen:

$ php maintenance/shell.php --wiki=awiki
Psy Shell v0.10.9 (PHP 7.2.34-23+0~20210701.63+debian10~1.gbpd7cd48 — cli) by Justin Hileman
>>> use MediaWiki\Extension\WikiSEO\WikiSEO;
>>> $seo = new WikiSEO();
=> MediaWiki\Extension\WikiSEO\WikiSEO {#3046}
>>> sudo $seo->loadPagePropsFromDb(60)
<warning>PHP Notice:  unserialize(): Error at offset 0 of 3 bytes in /home/urbanecm/unsynced/gerrit/mediawiki/extensions/WikiSEO/includes/WikiSEO.php on line 235</warning>
=> [
     "wikibase_item" => "Q43",
   ]

The php doc was indeed wrong. It is updated in the latest commit + fixing the warning.

Use of ParserOutput::getProperty was deprecated in MediaWiki 1.38.

This too has been fixed. I don't know what the "official way" of handling deprecation notices is, but I've added method_exists checks to the relevant calls (gerrit checks were successful).

Again, thanks for the comments :)

...While anyone with enough knowledge can download the extension and perform their own security review (for example), only the Wikimedia security team can give an authoritative assessment of the code. All those reviews can easily take several months or even years: for an example of a recent security review, see T269291 (and that was a project backed up by a WMF team, and partially taken from an already-reviewed extension).

Well, sort of. The Security-Team will assign a risk level after a security review is performed. The owners of the codebase will then either produce a mitigation plan to reduce the risk rating or have the risk accepted at an appropriate level. If the risk is deemed low, then great, the risk is automatically accepted by the WMF. If it's higher than that, it needs to be accepted by various levels of WMF management prior to any deployment. Unfortunately, the current WMF risk management framework still isn't public, but I did try to summarize the risk ownership section here: T249039#6309061.

Regarding prioritization of reviews, that is discussed within our SOP here. While we'd love to be able to security-review any and all new or significantly-changed code, we do not have the resources to do so. So we prioritize as best we can at the beginning of each quarter and try to review as many likely production-bound projects as we can. I'd also note that T269291 likely isn't the greatest example of a recent security review. That review was performed by a vendor and there were some quality issues with the report that we received. There are plenty of other reviews within our Done column that are a little more typical of what we try to do.

Finally - I'd note that the Security-Team is really trying to get away from performing large, manual security reviews as they are incredibly labor-intensive and scale terribly. We're hopeful that a combination of a proper appsec pipeline for Gitlab and threat-modeling and similar exercises which happen earlier during a project lifecycle will obviate the need for performing the current volume of application security reviews.

So given the above, what are the best steps to take next? Is it best to work through the issues identified by the Security-Team or make a smaller custom extension that can populate <meta name="citation_.*"> tags from wikidata?