Page MenuHomePhabricator

WikipediaExtracts code review
Open, NormalPublic

Description

The WikipediaExtracts extension code has been partially committed without following our CR +2 usual rules on Gerrit.

As such, a code review is needed.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dereckson added a comment.EditedNov 2 2016, 3:02 AM

Methods

Functions are too long, they do too much things.

Please follow https://jeroendedauw.github.io/slides/craftmanship/functions/ to refactor in smaller methods, each doing one thing.

Localisation

  • Messages are documented, nothing hardcoded
  • The <span class="error"> logic should be a specific method, using Html::rawElement instead to try to build HTML as string concatenation (see below for an example from the Cite extension)
includes/Cite.php
$lang = $this->mParser->getOptions()->getUserLangObj();
$dir = $lang->getDir();

Html::rawElement(
    'span',
    [
        'class' => 'error mw-ext-cite-error',
        'lang' => $lang->getHtmlCode(),
        'dir' => $dir,
    ],
    $msg
);

Note mSomething isn't in the code convention. If constructor gets the parser object, you can use $parser instead of $mParser, that will be more clean.

Magic words for parser tend to be lowercase (invoke for Scribunto for example). For French magic, extraitDepuisWikipédia is probably a better keyword.

Conventions

  • Spacing looks good to me
  • We switched to short array syntax. New extensions should follow that.
  • Some \n would be nice at EOF

A SensioLabs Insight analysis at commit 19c1c174781e gives me the following issues:

Duplicate code

In WikipediaExtracts.php, from line 26 and 94, you've 48 identical lines. That should be solved when you divide the code in small methods. Instead to duplicate code, you generally create a separate function.

Type hint

  • In the onParserFirstCallInit and onFunctionHook methods, you can typehint $parser to detect type issues. You've already done that on onHook method.

Visibility

Declare the functions without visibility keyword as public: several programming languages have different default visibility: private for C#, public for PHP for example.

Legoktm added a subscriber: Legoktm.Nov 2 2016, 3:20 AM

Code-style aside, when I reviewed the code for this extension, it appeared to be designed for external users to re-use Wikipedia content. However if we're planning to deploy this inside the Wikimedia farm, it needs to be made significantly more robust. For example, if file_get_contents fails, there's no error handling. There's also no debugging logging. I'd recommend using MWHttpRequest instead of file_get_contents.

What about batching? If a page contains multiple <wikipediaextract> tags, shouldn't we batch the internal API requests? I also don't see the need for both a tag extension and parser function - just support and implement one.

Code issues:

  • Don't use extract(), it makes it very hard to review and reason about code
  • As I had mentioned over email, we actually need to validate that the language code has a Wikipedia
  • API requests need to set a user-agent and should use &formatversion=2.

@Legoktm If only one (tag extension or parser function) should be used, then favor the parser function, as I've found it much much more useful, because it can be combined with {{PAGENAME}} and other magic words, which is quite important in practice.

I'll help to improve the code following your guidelines asap, but today I'm leaving the city for a few days. Cheers!

I'm working on this.

Change 319786 had a related patch set uploaded (by Sophivorus):
Requested fixes

https://gerrit.wikimedia.org/r/319786

Sophivorus removed Sophivorus as the assignee of this task.EditedNov 4 2016, 4:14 AM

I apologize for the ugly code you had to review. Some things I did wrong were out of ignorance, but most were out of laziness. It should be a bit better now.

One thing I didn't implement are batch requests. Because what if different calls to the parser function have different parameters? Should we check for that and do batch requests if they all have the same parameters, and multiple requests otherwise?

I'm removing myself as the assignee for the task because I think ultimately someone from the WMF must take care, right? Cheers.

Well clearly if I don't take care, no one will.

Change 319786 abandoned by Sophivorus:
Requested fixes

https://gerrit.wikimedia.org/r/319786

Change 334050 had a related patch set uploaded (by Sophivorus):
Requested fixes

https://gerrit.wikimedia.org/r/334050

Sophivorus triaged this task as High priority.EditedJan 25 2017, 12:50 PM

This extension was requested by the English and Spanish Wikiversity months ago. Other projects are also likely to want it, once it's available and has been presented to them. There's an active and able volunteer willing to make the requested improvements until it's fit for production. All I need is someone from the WMF who can guide me and ultimately approve and deploy the extension.

See https://phabricator.wikimedia.org/T149765 for subtasks and https://www.mediawiki.org/wiki/Review_queue for the related checklist.
See e.g. T149424#2762206 for one current blocker.

Change 334050 merged by Sophivorus:
Requested fixes

https://gerrit.wikimedia.org/r/334050

Sophivorus added a comment.EditedJan 29 2017, 3:10 AM

@Aklapper The comment you linked to refers to this task, so I'm not sure what you mean by it. I've carefully gone through the checklist and I think there's nothing else I can do at this point. Yesterday I merged the changes I did to the extension based on the code review by @Legoktm and today I updated the extension documentation accordingly. I think the next step would be a second code review so that I can fix any remaining issues, close this task and move on to the security review. Looking forward to it!

@Sophivorus Are you confident you've addressed Legoktm concerns of robustness?

Sophivorus added a comment.EditedMar 24 2017, 3:59 PM

@Dereckson Absolutely, I just double-checked everything.

@Dereckson Absolutely, I just double-checked everything.

Okay, thanks, I updated the security review information accordingly.

Reedy added a subscriber: Reedy.Oct 4 2017, 5:12 PM

Other little things..

  • There should be one class per PHP file
  • and should not be used, use &&
  • Use elseif not else if
  • Type hints for member variables and function variables should be defined
  • Enable phpcs

Change 382507 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Requested minor fixes

https://gerrit.wikimedia.org/r/382507

@Reedy Thanks for the review! I've done the changes you requested, check them out!

I removed the composer.lock file and fixed an issue Jenkins was complaining about.

Change 382507 merged by jenkins-bot:
[mediawiki/extensions/WikipediaExtracts@master] Requested minor fixes

https://gerrit.wikimedia.org/r/382507

> var_dump( Language::isValidCode( 'aln' ) );
bool(true)

> var_dump( Language::isValidCode( 'en-gb' ) );
bool(true)

> var_dump( Language::isValidCode( 'notalanguage' ) );
bool(true)

The extension shouldn't tell me it's not a valid language, when it is. It should differentiate between what's not a language code (our check is very vague, and a not existing wiki). invalid-language seems to be reused incorrectly

		$status = $request->execute();
		if ( !$status->isOK() ) {
			if ( $status->getValue() === 100 ) {
				throw new WikipediaExtractsError( 'invalid-language', self::$wikipediaLanguage );
			}
			throw new WikipediaExtractsError( 'error' );
		}

Also, there should be a space between the text and the "credits", otherwise it looks silly

			if ( $wgWikipediaExtractsAddCredits ) {
				$html .= self::getCredits();
			}

I would suspect the above might not be so RTL friendly too with it's positioning.. Probably?

I note all options don't seem to be documented

			'exchars' => self::getParam( 'chars' ),
			'exsentences' => self::getParam( 'sentences' ),
			'exlimit' => self::getParam( 'limit' ),
			'exintro' => self::getParam( 'intro' ),
			'explaintext' => self::getParam( 'plaintext' ),
			'exsectionformat' => self::getParam( 'sectionformat' ),
			'excontinue' => self::getParam( 'continue' ),
			'exvariant' => self::getParam( 'variant' ),

I would suspect passing them when they're not needed is kinda pointless too

			$pair = explode( '=', $param, 2 );
			if ( count( $pair ) === 2 ) {
				$name = trim( $pair[0] );
				$value = trim( $pair[1] );
				$array[ $name ] = $value;
			} elseif ( count( $pair ) === 1 ) {
				$name = trim( $pair[0] );
				$array[ $name ] = true;
			}

Trim can just be done in once place, with an array map. Also don't need temporary variables

			$pair = array_map( 'trim', explode( '=', $param, 2 ) );
			if ( count( $pair ) === 2 ) {
				$array[ $pair[0] ] = $pair[1];
			} elseif ( count( $pair ) === 1 ) {
				$array[ $pair[0] ] = true;
			}

I kinda feel like we need some caching on the onward web requests too to other wikipedias...

Reedy lowered the priority of this task from High to Normal.Nov 26 2017, 1:34 AM

Change 393440 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review

https://gerrit.wikimedia.org/r/393440

Sophivorus raised the priority of this task from Normal to High.EditedNov 26 2017, 1:29 PM

@Reedy Thanks for your detailed code review! My latest patch set should respond to all your concerns in one way or another.

The "invalid language" issue has been bugging me for a while, so I thought of a definitive solution: the language is no longer guessed but has to be explicitly set as part of the URL of the API of the target wiki, in a new config option called $WikipediaExtractsAPI. This is because I've found (through reasoning and experience) that almost every wiki will extract content exclusively from the Wikipedia in its own language. It makes little to no sense to allow wikis to extract content from other Wikipedias, and it brings language validation troubles. Furthermore, by adding the $WikipediaExtractsAPI config option, we open the door for extracting content from wikis other than Wikipedia, such as Wiktionary! Overall, I think that this change in approach is a great improvement. In future iterations, I may expand the config option to allow multiple wikis, like so:

$WikipediaExtractsAPI = [
    'w' => 'https://en.wikipedia.org/w/api.php',
    'v' => 'https://en.wikiversity.org/w/api.php'
];

Regarding the credits span, I removed this functionality in favor of adding the credits through templates, as explained in the docs. I wrote about it in the docs some days ago, before updating the code (bad practice) so it may have confused you. Sorry!

The rest of the changes should be self-explanatory, I hope. Any questions just let me know!

Reedy added a comment.Nov 26 2017, 1:33 PM

I wonder if we can reuse the sites table/interwiki map here...

Sophivorus added a comment.EditedNov 26 2017, 1:34 PM

I wonder if we can reuse the sites table/interwiki map here...

Thought about it, but the problem is that the target wiki has to have the TextExtracts extension enabled, and there's no way to tell that from the table/interwiki map.

Reedy added a comment.Nov 26 2017, 1:40 PM
'wmgEnableTextExtracts' => [
	'default' => true,
],

It's already on all WMF wikis ;)

Sophivorus added a comment.EditedNov 26 2017, 1:42 PM
'wmgEnableTextExtracts' => [
	'default' => true,
],

It's already on all WMF wikis ;)

Good to know! However, I would leave this improvement for a future iteration. It's been more than a year already and I'd really like to get this deployed. Ever head about agile development? ;-)

Also, the table/interwiki map contains many non-WMF wikis.

Reedy added a comment.Nov 26 2017, 1:49 PM

Agile Development != Agile Deployment :P

Ok, but still, I think that such an improvement is better left for a future iteration. There's no real need for it and it's not easy to implement.

Aklapper lowered the priority of this task from High to Normal.Nov 26 2017, 6:13 PM

(resetting Priority to reflect reality)

Reedy added a comment.Nov 26 2017, 8:37 PM

You probably need to update https://www.mediawiki.org/wiki/Extension:WikipediaExtracts again as I nuked your docs as they seemed like they were out of date...

Change 393440 abandoned by Sophivorus:
Respond to Reedy's code review

Reason:
Doing manual rebase...

https://gerrit.wikimedia.org/r/393440

Change 394825 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review

https://gerrit.wikimedia.org/r/394825

You probably need to update https://www.mediawiki.org/wiki/Extension:WikipediaExtracts again as I nuked your docs as they seemed like they were out of date...

I will do it once the current patch set is reviewed and merged. Thanks!

Change 394825 merged by jenkins-bot:
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review

https://gerrit.wikimedia.org/r/394825

Reedy added a comment.Dec 20 2017, 4:36 PM
	/**
	 * Get a span with a link to Wikipedia
	 */
	private static function getCredits() {
		$title = Title::newFromText( self::$wikipediaTitle );
		$url = 'https://' . self::$wikipediaLanguage . '.wikipedia.org/wiki/' . $title->getPartialUrl();
		return Html::rawElement(
			'small', [
				'lang' => self::$contentLanguage->getHtmlCode(),
				'dir' => self::$contentLanguage->getDir()
			],
			wfMessage( 'wikipediaextracts-credits', $url )->inContentLanguage()->plain()
		);
	}

Title::newFromText can/will return null. Should probably guard against that.

The function doesn't return a span. Should be annotated @return string too

{{#WikipediaExtract:https://en.wikipedia.org/w/index.php?title=Title_of_the_article}} won't work... But do we really care?

In parseParams, $params is not returned. Generally, across the code, you don't need to put the return variable name in the @return part.

https://www.mediawiki.org/wiki/Extension:WikipediaExtracts#Usage - Title is a valid parameter, not documented... Kinda seems useless if the unparameterised string is the title. As if you set both, one is apparently going to win

I'm trying to decide if the url parsing is even really necessary. Doesn't seem useful complexity. Now is the time to remove pointless complexity, forcing people to use things properly when they're implemented

Change 399999 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy

https://gerrit.wikimedia.org/r/399999

Sophivorus added a comment.EditedDec 23 2017, 6:53 PM

@Reedy Hi, I like your code reviews, they are leading me to simplify the code considerably, thanks!

Some of your comments seemed to be based on a previous version of the code. The getCredits() method has been totally removed, because its functionality is better served by adding the credits via a template. I just updated the documentation of the extension to cover this. You also mention that I need not put the return variable name in the @return part. Well, I'm not sure if I ever did that, but in any case the current code doesn't mention the return variable name in the @return part.

I agree there's no need for the URL parsing or the "title" parameter. This led me to remove other things that became unnecessary, such as the getWikipediaTitle() method, and the $parser and $wikipediaTitle static properties.

Thanks again for the review!

Zerabat added a subscriber: Zerabat.EditedMar 6 2018, 2:25 PM

Can the parser function extract contents from a user-chosen section? For example, it currently can fetch full text, a number of sentences or the intro. Instead of intro it could fetch sections, leading section included:

changing

{{#WikipediaExtract: Title of the article | intro = true}}

with

{{#WikipediaExtract: Title of the article | section = 0}} /* Leading section */

Parameter "intro=true" could be an alias for "section=0".

Other sections would be like this one:

{{#WikipediaExtract: Title of the article | section = 1}} /* First named section "== ... ==" */

But there would be a problem. What would be taken as section by the extension? Level 2 "== ... =="? Level 3 "=== ... ==="? All levels? Could subsections be fetched with its parent section when requested if only a certain section level could be requested by the parser function? How to manage when a page is using a non-standard section layout (such as using level 1 sections "= ... =" or no sections at all)?

Zerabat added a comment.EditedMar 6 2018, 2:37 PM

Could it be set to fetch contents from several wikis?
For example:

$wgWikipediaExtractsAPI = array ('//en.wikipedia.org/w/api.php',
                                 '//en.wiktionary.org/w/api.php',
                                 '//en.wikiquote.org/w/api.php'
);

But if it would be like that, there should exist a way to solve a conflict between pages with the same name in several wikis.

Sophivorus added a comment.EditedMar 6 2018, 2:44 PM

@Zerabat WikipediaExtracts uses the API introduced by the TextExtracts extension, which doesn't include an option for extracting sections. Therefore, if you want to implement this functionality, you should probably look towards extending the TextExtracts API. Once that is done, extending WikipediaExtracts to take advantage of the extended API would be trivial. Now there's also the PageSummary API, which may introduce some other way of adding this functionality. Not sure though.

Regarding using contents from several wikis, it's a natural and desired addition to the current functionality. There are a couple of ways to solve the issue you mention, such as giving priority to wikis higher on the array, or giving a key to each API in the array, and requiring the user to input the key for the desired wiki into the parser function. However, all this extra functionality shouldn't be necessary for approving the current version of the extension. I've been waiting for it to get approved for MORE THAN TWO YEARS, so I'm not about to introduce any new functionality that may rise new concerns or bugs and delay the approval any further. Thanks for the interest and the suggestion though!

I understand your input. My suggestions were to a future version of the extension(s), not the current deployment, which has been feature-frozen in order to become stable.

Can anyone with the skill and permissions approve and deploy this extension? It's been more than two years already. Is this how the Wikimedia Foundation integrates features developed by the community?

Also looks like https://gerrit.wikimedia.org/r/#/c/399999/ is still awaiting a review :(

Change 399999 abandoned by Sophivorus:
Respond to second code review by Reedy

Reason:
Cannot merge anymore. Manually rebasing this change.

https://gerrit.wikimedia.org/r/399999

Change 485366 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy

https://gerrit.wikimedia.org/r/485366

Change 485366 merged by Sophivorus:
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy

https://gerrit.wikimedia.org/r/485366