WikipediaExtracts code review
Closed, InvalidPublic
Actions

Assigned To

Authored By

	Dereckson
	Nov 2 2016, 2:39 AM

Description

The WikipediaExtracts extension code has been partially committed without following our CR +2 usual rules on Gerrit.

As such, a code review is needed.

Details

Subject	Repo	Branch	Lines +/-
Respond to second code review by Reedy	mediawiki/extensions/WikipediaExtracts	master	+17 -73
Respond to second code review by Reedy	mediawiki/extensions/WikipediaExtracts	master	+10 -48
Respond to Reedy's code review	mediawiki/extensions/WikipediaExtracts	master	+81 -162
Respond to Reedy's code review	mediawiki/extensions/WikipediaExtracts	master	+77 -136
Requested minor fixes	mediawiki/extensions/WikipediaExtracts	master	+113 -57
Requested fixes	mediawiki/extensions/WikipediaExtracts	master	+199 -147
Requested fixes	mediawiki/extensions/WikipediaExtracts	master	+199 -344

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T148848 Add the Extension:WikipediaExtracts to the English Wikiversity
Invalid	None	T149765 Deploy WikipediaExtracts extension
Invalid	Sophivorus	T149766 WikipediaExtracts code review

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Methods

Functions are too long, they do too much things.

Please follow https://jeroendedauw.github.io/slides/craftmanship/functions/ to refactor in smaller methods, each doing one thing.

Localisation

Messages are documented, nothing hardcoded
The <span class="error"> logic should be a specific method, using Html::rawElement instead to try to build HTML as string concatenation (see below for an example from the Cite extension)

includes/Cite.php

$lang = $this->mParser->getOptions()->getUserLangObj();
$dir = $lang->getDir();
…
Html::rawElement(
    'span',
    [
        'class' => 'error mw-ext-cite-error',
        'lang' => $lang->getHtmlCode(),
        'dir' => $dir,
    ],
    $msg
);

Note mSomething isn't in the code convention. If constructor gets the parser object, you can use $parser instead of $mParser, that will be more clean.

Magic words for parser tend to be lowercase (invoke for Scribunto for example). For French magic, extraitDepuisWikipédia is probably a better keyword.

Conventions

Spacing looks good to me
We switched to short array syntax. New extensions should follow that.
Some \n would be nice at EOF

A SensioLabs Insight analysis at commit 19c1c174781e gives me the following issues:

Duplicate code

In WikipediaExtracts.php, from line 26 and 94, you've 48 identical lines. That should be solved when you divide the code in small methods. Instead to duplicate code, you generally create a separate function.

Type hint

In the onParserFirstCallInit and onFunctionHook methods, you can typehint $parser to detect type issues. You've already done that on onHook method.

Visibility

Declare the functions without visibility keyword as public: several programming languages have different default visibility: private for C#, public for PHP for example.

Dereckson mentioned this in T149424: Security review the Extension:WikipediaExtracts.Nov 2 2016, 3:04 AM

Code-style aside, when I reviewed the code for this extension, it appeared to be designed for external users to re-use Wikipedia content. However if we're planning to deploy this inside the Wikimedia farm, it needs to be made significantly more robust. For example, if file_get_contents fails, there's no error handling. There's also no debugging logging. I'd recommend using MWHttpRequest instead of file_get_contents.

What about batching? If a page contains multiple <wikipediaextract> tags, shouldn't we batch the internal API requests? I also don't see the need for both a tag extension and parser function - just support and implement one.

Code issues:

Don't use extract(), it makes it very hard to review and reason about code
As I had mentioned over email, we actually need to validate that the language code has a Wikipedia
API requests need to set a user-agent and should use &formatversion=2.

@Legoktm If only one (tag extension or parser function) should be used, then favor the parser function, as I've found it much much more useful, because it can be combined with {{PAGENAME}} and other magic words, which is quite important in practice.

I'll help to improve the code following your guidelines asap, but today I'm leaving the city for a few days. Cheers!

Aklapper edited projects, added MediaWiki-extensions-WikipediaExtracts; removed MediaWiki-extensions-General.Nov 2 2016, 1:18 PM

I'm working on this.

Change 319786 had a related patch set uploaded (by Sophivorus):
Requested fixes

https://gerrit.wikimedia.org/r/319786

gerritbot added a project: Patch-For-Review.Nov 4 2016, 4:10 AM

Sophivorus mentioned this in rEWPE7770a9fea5ca: Requested fixes.Nov 4 2016, 4:12 AM

I apologize for the ugly code you had to review. Some things I did wrong were out of ignorance, but most were out of laziness. It should be a bit better now.

One thing I didn't implement are batch requests. Because what if different calls to the parser function have different parameters? Should we check for that and do batch requests if they all have the same parameters, and multiple requests otherwise?

I'm removing myself as the assignee for the task because I think ultimately someone from the WMF must take care, right? Cheers.

Sophivorus mentioned this in rEWPE4fb84f60c765: Requested fixes.Nov 4 2016, 4:24 AM

Sophivorus mentioned this in rEWPEe8604b73e2f7: Requested fixes.Nov 6 2016, 7:06 PM

Sophivorus mentioned this in rEWPE52c7173750fa: Requested fixes.Jan 25 2017, 12:40 PM

Well clearly if I don't take care, no one will.

Change 319786 abandoned by Sophivorus:
Requested fixes

https://gerrit.wikimedia.org/r/319786

Change 334050 had a related patch set uploaded (by Sophivorus):
Requested fixes

https://gerrit.wikimedia.org/r/334050

Sophivorus mentioned this in rEWPE7a783209ec1e: Requested fixes.Jan 25 2017, 12:48 PM

This extension was requested by the English and Spanish Wikiversity months ago. Other projects are also likely to want it, once it's available and has been presented to them. There's an active and able volunteer willing to make the requested improvements until it's fit for production. All I need is someone from the WMF who can guide me and ultimately approve and deploy the extension.

See https://phabricator.wikimedia.org/T149765 for subtasks and https://www.mediawiki.org/wiki/Review_queue for the related checklist.
See e.g. T149424#2762206 for one current blocker.

Change 334050 merged by Sophivorus:
Requested fixes

https://gerrit.wikimedia.org/r/334050

@Aklapper The comment you linked to refers to this task, so I'm not sure what you mean by it. I've carefully gone through the checklist and I think there's nothing else I can do at this point. Yesterday I merged the changes I did to the extension based on the code review by @Legoktm and today I updated the extension documentation accordingly. I think the next step would be a second code review so that I can fix any remaining issues, close this task and move on to the security review. Looking forward to it!

@Sophivorus Are you confident you've addressed Legoktm concerns of robustness?

@Dereckson Absolutely, I just double-checked everything.

In T149766#3128431, @Sophivorus wrote:

@Dereckson Absolutely, I just double-checked everything.

Okay, thanks, I updated the security review information accordingly.

Reedy removed a project: Patch-For-Review.Oct 4 2017, 5:01 PM

Other little things..

There should be one class per PHP file
and should not be used, use &&
Use elseif not else if
Type hints for member variables and function variables should be defined
Enable phpcs

Change 382507 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Requested minor fixes

https://gerrit.wikimedia.org/r/382507

gerritbot added a project: Patch-For-Review.Oct 5 2017, 6:13 PM

@Reedy Thanks for the review! I've done the changes you requested, check them out!

Sophivorus mentioned this in rEWPE1e73a8e053b8: Requested minor fixes.Oct 5 2017, 7:35 PM

Sophivorus mentioned this in rEWPE9ac0688e93be: Requested minor fixes.

Sophivorus mentioned this in rEWPE0b55696e600d: Requested minor fixes.Oct 5 2017, 8:35 PM

I removed the composer.lock file and fixed an issue Jenkins was complaining about.

Change 382507 merged by jenkins-bot:
[mediawiki/extensions/WikipediaExtracts@master] Requested minor fixes

https://gerrit.wikimedia.org/r/382507

Umherirrender mentioned this in rEWPEfd84184bf8c6: Requested minor fixes.Nov 2 2017, 9:52 AM

Umherirrender mentioned this in rEWPEc4927e861551: Requested minor fixes.

Reedy removed a project: Patch-For-Review.Nov 25 2017, 11:43 PM

> var_dump( Language::isValidCode( 'aln' ) );
bool(true)

> var_dump( Language::isValidCode( 'en-gb' ) );
bool(true)

> var_dump( Language::isValidCode( 'notalanguage' ) );
bool(true)

The extension shouldn't tell me it's not a valid language, when it is. It should differentiate between what's not a language code (our check is very vague, and a not existing wiki). invalid-language seems to be reused incorrectly

		$status = $request->execute();
		if ( !$status->isOK() ) {
			if ( $status->getValue() === 100 ) {
				throw new WikipediaExtractsError( 'invalid-language', self::$wikipediaLanguage );
			}
			throw new WikipediaExtractsError( 'error' );
		}

Also, there should be a space between the text and the "credits", otherwise it looks silly

Screen Shot 2017-11-26 at 00.00.52.png (76×1 px, 33 KB)

			if ( $wgWikipediaExtractsAddCredits ) {
				$html .= self::getCredits();
			}

I would suspect the above might not be so RTL friendly too with it's positioning.. Probably?

I note all options don't seem to be documented

			'exchars' => self::getParam( 'chars' ),
			'exsentences' => self::getParam( 'sentences' ),
			'exlimit' => self::getParam( 'limit' ),
			'exintro' => self::getParam( 'intro' ),
			'explaintext' => self::getParam( 'plaintext' ),
			'exsectionformat' => self::getParam( 'sectionformat' ),
			'excontinue' => self::getParam( 'continue' ),
			'exvariant' => self::getParam( 'variant' ),

I would suspect passing them when they're not needed is kinda pointless too

			$pair = explode( '=', $param, 2 );
			if ( count( $pair ) === 2 ) {
				$name = trim( $pair[0] );
				$value = trim( $pair[1] );
				$array[ $name ] = $value;
			} elseif ( count( $pair ) === 1 ) {
				$name = trim( $pair[0] );
				$array[ $name ] = true;
			}

Trim can just be done in once place, with an array map. Also don't need temporary variables

			$pair = array_map( 'trim', explode( '=', $param, 2 ) );
			if ( count( $pair ) === 2 ) {
				$array[ $pair[0] ] = $pair[1];
			} elseif ( count( $pair ) === 1 ) {
				$array[ $pair[0] ] = true;
			}

I kinda feel like we need some caching on the onward web requests too to other wikipedias...

https://www.mediawiki.org/wiki/Extension:WikipediaExtracts#Crediting_Wikipedia isn't this irrelevant now?

Reedy lowered the priority of this task from High to Medium.Nov 26 2017, 1:34 AM

Change 393440 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review

https://gerrit.wikimedia.org/r/393440

gerritbot added a project: Patch-For-Review.Nov 26 2017, 1:19 PM

Sophivorus mentioned this in rEWPE3b8ec2d0d1dc: Respond to Reedy's code review.Nov 26 2017, 1:19 PM

@Reedy Thanks for your detailed code review! My latest patch set should respond to all your concerns in one way or another.

The "invalid language" issue has been bugging me for a while, so I thought of a definitive solution: the language is no longer guessed but has to be explicitly set as part of the URL of the API of the target wiki, in a new config option called $WikipediaExtractsAPI. This is because I've found (through reasoning and experience) that almost every wiki will extract content exclusively from the Wikipedia in its own language. It makes little to no sense to allow wikis to extract content from other Wikipedias, and it brings language validation troubles. Furthermore, by adding the $WikipediaExtractsAPI config option, we open the door for extracting content from wikis other than Wikipedia, such as Wiktionary! Overall, I think that this change in approach is a great improvement. In future iterations, I may expand the config option to allow multiple wikis, like so:

$WikipediaExtractsAPI = [
    'w' => 'https://en.wikipedia.org/w/api.php',
    'v' => 'https://en.wikiversity.org/w/api.php'
];

Regarding the credits span, I removed this functionality in favor of adding the credits through templates, as explained in the docs. I wrote about it in the docs some days ago, before updating the code (bad practice) so it may have confused you. Sorry!

The rest of the changes should be self-explanatory, I hope. Any questions just let me know!

I wonder if we can reuse the sites table/interwiki map here...

In T149766#3787294, @Reedy wrote:

I wonder if we can reuse the sites table/interwiki map here...

Thought about it, but the problem is that the target wiki has to have the TextExtracts extension enabled, and there's no way to tell that from the table/interwiki map.

'wmgEnableTextExtracts' => [
	'default' => true,
],

It's already on all WMF wikis ;)

In T149766#3787296, @Reedy wrote:
'wmgEnableTextExtracts' => [
	'default' => true,
],
It's already on all WMF wikis ;)

Good to know! However, I would leave this improvement for a future iteration. It's been more than a year already and I'd really like to get this deployed. Ever head about agile development? ;-)

Also, the table/interwiki map contains many non-WMF wikis.

Agile Development != Agile Deployment :P

Ok, but still, I think that such an improvement is better left for a future iteration. There's no real need for it and it's not easy to implement.

(resetting Priority to reflect reality)

You probably need to update https://www.mediawiki.org/wiki/Extension:WikipediaExtracts again as I nuked your docs as they seemed like they were out of date...

Change 393440 abandoned by Sophivorus:
Respond to Reedy's code review

Reason:
Doing manual rebase...

https://gerrit.wikimedia.org/r/393440

Change 394825 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review

https://gerrit.wikimedia.org/r/394825

In T149766#3787592, @Reedy wrote:

You probably need to update https://www.mediawiki.org/wiki/Extension:WikipediaExtracts again as I nuked your docs as they seemed like they were out of date...

I will do it once the current patch set is reviewed and merged. Thanks!

Sophivorus mentioned this in rEWPE790c97e5aa65: Respond to Reedy's code review.Dec 3 2017, 2:16 AM

Sophivorus mentioned this in rEWPE1b6cf5b70cf1: Respond to Reedy's code review.

Change 394825 merged by jenkins-bot:
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review

https://gerrit.wikimedia.org/r/394825

Reedy removed a project: Patch-For-Review.Dec 7 2017, 7:50 PM

	/**
	 * Get a span with a link to Wikipedia
	 */
	private static function getCredits() {
		$title = Title::newFromText( self::$wikipediaTitle );
		$url = 'https://' . self::$wikipediaLanguage . '.wikipedia.org/wiki/' . $title->getPartialUrl();
		return Html::rawElement(
			'small', [
				'lang' => self::$contentLanguage->getHtmlCode(),
				'dir' => self::$contentLanguage->getDir()
			],
			wfMessage( 'wikipediaextracts-credits', $url )->inContentLanguage()->plain()
		);
	}

Title::newFromText can/will return null. Should probably guard against that.

The function doesn't return a span. Should be annotated @return string too

{{#WikipediaExtract:https://en.wikipedia.org/w/index.php?title=Title_of_the_article}} won't work... But do we really care?

In parseParams, $params is not returned. Generally, across the code, you don't need to put the return variable name in the @return part.

https://www.mediawiki.org/wiki/Extension:WikipediaExtracts#Usage - Title is a valid parameter, not documented... Kinda seems useless if the unparameterised string is the title. As if you set both, one is apparently going to win

I'm trying to decide if the url parsing is even really necessary. Doesn't seem useful complexity. Now is the time to remove pointless complexity, forcing people to use things properly when they're implemented

Change 399999 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy

https://gerrit.wikimedia.org/r/399999

gerritbot added a project: Patch-For-Review.Dec 23 2017, 6:48 PM

@Reedy Hi, I like your code reviews, they are leading me to simplify the code considerably, thanks!

Some of your comments seemed to be based on a previous version of the code. The getCredits() method has been totally removed, because its functionality is better served by adding the credits via a template. I just updated the documentation of the extension to cover this. You also mention that I need not put the return variable name in the @return part. Well, I'm not sure if I ever did that, but in any case the current code doesn't mention the return variable name in the @return part.

I agree there's no need for the URL parsing or the "title" parameter. This led me to remove other things that became unnecessary, such as the getWikipediaTitle() method, and the $parser and $wikipediaTitle static properties.

Thanks again for the review!

Sophivorus mentioned this in rEWPE35ec0a1398d2: Respond to second code review by Reedy.Dec 23 2017, 6:54 PM

Sophivorus mentioned this in rEWPEefc328e2e8b2: Respond to second code review by Reedy.Dec 23 2017, 7:26 PM

Can the parser function extract contents from a user-chosen section? For example, it currently can fetch full text, a number of sentences or the intro. Instead of intro it could fetch sections, leading section included:

changing

{{#WikipediaExtract: Title of the article | intro = true}}

with

{{#WikipediaExtract: Title of the article | section = 0}} /* Leading section */

Parameter "intro=true" could be an alias for "section=0".

Other sections would be like this one:

{{#WikipediaExtract: Title of the article | section = 1}} /* First named section "== ... ==" */

But there would be a problem. What would be taken as section by the extension? Level 2 "== ... =="? Level 3 "=== ... ==="? All levels? Could subsections be fetched with its parent section when requested if only a certain section level could be requested by the parser function? How to manage when a page is using a non-standard section layout (such as using level 1 sections "= ... =" or no sections at all)?

Could it be set to fetch contents from several wikis?
For example:

$wgWikipediaExtractsAPI = array ('//en.wikipedia.org/w/api.php',
                                 '//en.wiktionary.org/w/api.php',
                                 '//en.wikiquote.org/w/api.php'
);

But if it would be like that, there should exist a way to solve a conflict between pages with the same name in several wikis.

@Zerabat WikipediaExtracts uses the API introduced by the TextExtracts extension, which doesn't include an option for extracting sections. Therefore, if you want to implement this functionality, you should probably look towards extending the TextExtracts API. Once that is done, extending WikipediaExtracts to take advantage of the extended API would be trivial. Now there's also the PageSummary API, which may introduce some other way of adding this functionality. Not sure though.

Regarding using contents from several wikis, it's a natural and desired addition to the current functionality. There are a couple of ways to solve the issue you mention, such as giving priority to wikis higher on the array, or giving a key to each API in the array, and requiring the user to input the key for the desired wiki into the parser function. However, all this extra functionality shouldn't be necessary for approving the current version of the extension. I've been waiting for it to get approved for MORE THAN TWO YEARS, so I'm not about to introduce any new functionality that may rise new concerns or bugs and delay the approval any further. Thanks for the interest and the suggestion though!

I understand your input. My suggestions were to a future version of the extension(s), not the current deployment, which has been feature-frozen in order to become stable.

Can anyone with the skill and permissions approve and deploy this extension? It's been more than two years already. Is this how the Wikimedia Foundation integrates features developed by the community?

@Sophivorus: That looks like a question for T149765: Deploy WikipediaExtracts extension?

Also looks like https://gerrit.wikimedia.org/r/#/c/399999/ is still awaiting a review :(

Change 399999 abandoned by Sophivorus:
Respond to second code review by Reedy

Reason:
Cannot merge anymore. Manually rebasing this change.

https://gerrit.wikimedia.org/r/399999

Change 485366 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy

https://gerrit.wikimedia.org/r/485366

Sophivorus mentioned this in rEWPEb6a406888cc8: Respond to second code review by Reedy.Jan 19 2019, 2:51 PM

Change 485366 merged by Sophivorus:
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy

https://gerrit.wikimedia.org/r/485366

jijiki mentioned this in rEWPEb0d1f9ed43de: Respond to second code review by Reedy.Apr 5 2019, 11:56 PM

Aklapper mentioned this in T149765: Deploy WikipediaExtracts extension.May 4 2019, 11:11 PM

After some talk with the WikiJournal user group at Wikiversity, I decided to generalize this extension into InterwikiExtracts, archive WikipediaExtracts, and in some days start a new task for enabling InterwikiExtracts on Wikiversity, this time with the support of the WikiJournal user group. So this task makes no more sense and I'm closing it.

	F11000409: Screen Shot 2017-11-26 at 00.00.52.png
	Nov 26 2017, 12:04 AM

WikipediaExtracts code reviewClosed, InvalidPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

WikipediaExtracts code review
Closed, InvalidPublic
Actions

Related Objects
Search...