The WikipediaExtracts extension code has been partially committed without following our CR +2 usual rules on Gerrit.
As such, a code review is needed.
Dereckson | |
Nov 2 2016, 2:39 AM |
F11000409: Screen Shot 2017-11-26 at 00.00.52.png | |
Nov 26 2017, 12:04 AM |
The WikipediaExtracts extension code has been partially committed without following our CR +2 usual rules on Gerrit.
As such, a code review is needed.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T148848 Add the Extension:WikipediaExtracts to the English Wikiversity | |||
Invalid | None | T149765 Deploy WikipediaExtracts extension | |||
Invalid | Sophivorus | T149766 WikipediaExtracts code review |
Methods
Functions are too long, they do too much things.
Please follow https://jeroendedauw.github.io/slides/craftmanship/functions/ to refactor in smaller methods, each doing one thing.
Localisation
$lang = $this->mParser->getOptions()->getUserLangObj(); $dir = $lang->getDir(); … Html::rawElement( 'span', [ 'class' => 'error mw-ext-cite-error', 'lang' => $lang->getHtmlCode(), 'dir' => $dir, ], $msg );
Note mSomething isn't in the code convention. If constructor gets the parser object, you can use $parser instead of $mParser, that will be more clean.
Magic words for parser tend to be lowercase (invoke for Scribunto for example). For French magic, extraitDepuisWikipédia is probably a better keyword.
Conventions
A SensioLabs Insight analysis at commit 19c1c174781e gives me the following issues:
Duplicate code
In WikipediaExtracts.php, from line 26 and 94, you've 48 identical lines. That should be solved when you divide the code in small methods. Instead to duplicate code, you generally create a separate function.
Type hint
Visibility
Declare the functions without visibility keyword as public: several programming languages have different default visibility: private for C#, public for PHP for example.
Code-style aside, when I reviewed the code for this extension, it appeared to be designed for external users to re-use Wikipedia content. However if we're planning to deploy this inside the Wikimedia farm, it needs to be made significantly more robust. For example, if file_get_contents fails, there's no error handling. There's also no debugging logging. I'd recommend using MWHttpRequest instead of file_get_contents.
What about batching? If a page contains multiple <wikipediaextract> tags, shouldn't we batch the internal API requests? I also don't see the need for both a tag extension and parser function - just support and implement one.
Code issues:
@Legoktm If only one (tag extension or parser function) should be used, then favor the parser function, as I've found it much much more useful, because it can be combined with {{PAGENAME}} and other magic words, which is quite important in practice.
I'll help to improve the code following your guidelines asap, but today I'm leaving the city for a few days. Cheers!
I apologize for the ugly code you had to review. Some things I did wrong were out of ignorance, but most were out of laziness. It should be a bit better now.
One thing I didn't implement are batch requests. Because what if different calls to the parser function have different parameters? Should we check for that and do batch requests if they all have the same parameters, and multiple requests otherwise?
I'm removing myself as the assignee for the task because I think ultimately someone from the WMF must take care, right? Cheers.
This extension was requested by the English and Spanish Wikiversity months ago. Other projects are also likely to want it, once it's available and has been presented to them. There's an active and able volunteer willing to make the requested improvements until it's fit for production. All I need is someone from the WMF who can guide me and ultimately approve and deploy the extension.
See https://phabricator.wikimedia.org/T149765 for subtasks and https://www.mediawiki.org/wiki/Review_queue for the related checklist.
See e.g. T149424#2762206 for one current blocker.
@Aklapper The comment you linked to refers to this task, so I'm not sure what you mean by it. I've carefully gone through the checklist and I think there's nothing else I can do at this point. Yesterday I merged the changes I did to the extension based on the code review by @Legoktm and today I updated the extension documentation accordingly. I think the next step would be a second code review so that I can fix any remaining issues, close this task and move on to the security review. Looking forward to it!
Other little things..
Change 382507 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Requested minor fixes
Change 382507 merged by jenkins-bot:
[mediawiki/extensions/WikipediaExtracts@master] Requested minor fixes
> var_dump( Language::isValidCode( 'aln' ) ); bool(true) > var_dump( Language::isValidCode( 'en-gb' ) ); bool(true) > var_dump( Language::isValidCode( 'notalanguage' ) ); bool(true)
The extension shouldn't tell me it's not a valid language, when it is. It should differentiate between what's not a language code (our check is very vague, and a not existing wiki). invalid-language seems to be reused incorrectly
$status = $request->execute(); if ( !$status->isOK() ) { if ( $status->getValue() === 100 ) { throw new WikipediaExtractsError( 'invalid-language', self::$wikipediaLanguage ); } throw new WikipediaExtractsError( 'error' ); }
Also, there should be a space between the text and the "credits", otherwise it looks silly
if ( $wgWikipediaExtractsAddCredits ) { $html .= self::getCredits(); }
I would suspect the above might not be so RTL friendly too with it's positioning.. Probably?
I note all options don't seem to be documented
'exchars' => self::getParam( 'chars' ), 'exsentences' => self::getParam( 'sentences' ), 'exlimit' => self::getParam( 'limit' ), 'exintro' => self::getParam( 'intro' ), 'explaintext' => self::getParam( 'plaintext' ), 'exsectionformat' => self::getParam( 'sectionformat' ), 'excontinue' => self::getParam( 'continue' ), 'exvariant' => self::getParam( 'variant' ),
I would suspect passing them when they're not needed is kinda pointless too
$pair = explode( '=', $param, 2 ); if ( count( $pair ) === 2 ) { $name = trim( $pair[0] ); $value = trim( $pair[1] ); $array[ $name ] = $value; } elseif ( count( $pair ) === 1 ) { $name = trim( $pair[0] ); $array[ $name ] = true; }
Trim can just be done in once place, with an array map. Also don't need temporary variables
$pair = array_map( 'trim', explode( '=', $param, 2 ) ); if ( count( $pair ) === 2 ) { $array[ $pair[0] ] = $pair[1]; } elseif ( count( $pair ) === 1 ) { $array[ $pair[0] ] = true; }
I kinda feel like we need some caching on the onward web requests too to other wikipedias...
https://www.mediawiki.org/wiki/Extension:WikipediaExtracts#Crediting_Wikipedia isn't this irrelevant now?
Change 393440 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review
@Reedy Thanks for your detailed code review! My latest patch set should respond to all your concerns in one way or another.
The "invalid language" issue has been bugging me for a while, so I thought of a definitive solution: the language is no longer guessed but has to be explicitly set as part of the URL of the API of the target wiki, in a new config option called $WikipediaExtractsAPI. This is because I've found (through reasoning and experience) that almost every wiki will extract content exclusively from the Wikipedia in its own language. It makes little to no sense to allow wikis to extract content from other Wikipedias, and it brings language validation troubles. Furthermore, by adding the $WikipediaExtractsAPI config option, we open the door for extracting content from wikis other than Wikipedia, such as Wiktionary! Overall, I think that this change in approach is a great improvement. In future iterations, I may expand the config option to allow multiple wikis, like so:
$WikipediaExtractsAPI = [ 'w' => 'https://en.wikipedia.org/w/api.php', 'v' => 'https://en.wikiversity.org/w/api.php' ];
Regarding the credits span, I removed this functionality in favor of adding the credits through templates, as explained in the docs. I wrote about it in the docs some days ago, before updating the code (bad practice) so it may have confused you. Sorry!
The rest of the changes should be self-explanatory, I hope. Any questions just let me know!
Thought about it, but the problem is that the target wiki has to have the TextExtracts extension enabled, and there's no way to tell that from the table/interwiki map.
Good to know! However, I would leave this improvement for a future iteration. It's been more than a year already and I'd really like to get this deployed. Ever head about agile development? ;-)
Ok, but still, I think that such an improvement is better left for a future iteration. There's no real need for it and it's not easy to implement.
You probably need to update https://www.mediawiki.org/wiki/Extension:WikipediaExtracts again as I nuked your docs as they seemed like they were out of date...
Change 393440 abandoned by Sophivorus:
Respond to Reedy's code review
Reason:
Doing manual rebase...
Change 394825 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review
Change 394825 merged by jenkins-bot:
[mediawiki/extensions/WikipediaExtracts@master] Respond to Reedy's code review
/** * Get a span with a link to Wikipedia */ private static function getCredits() { $title = Title::newFromText( self::$wikipediaTitle ); $url = 'https://' . self::$wikipediaLanguage . '.wikipedia.org/wiki/' . $title->getPartialUrl(); return Html::rawElement( 'small', [ 'lang' => self::$contentLanguage->getHtmlCode(), 'dir' => self::$contentLanguage->getDir() ], wfMessage( 'wikipediaextracts-credits', $url )->inContentLanguage()->plain() ); }
Title::newFromText can/will return null. Should probably guard against that.
The function doesn't return a span. Should be annotated @return string too
{{#WikipediaExtract:https://en.wikipedia.org/w/index.php?title=Title_of_the_article}} won't work... But do we really care?
In parseParams, $params is not returned. Generally, across the code, you don't need to put the return variable name in the @return part.
https://www.mediawiki.org/wiki/Extension:WikipediaExtracts#Usage - Title is a valid parameter, not documented... Kinda seems useless if the unparameterised string is the title. As if you set both, one is apparently going to win
I'm trying to decide if the url parsing is even really necessary. Doesn't seem useful complexity. Now is the time to remove pointless complexity, forcing people to use things properly when they're implemented
Change 399999 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy
@Reedy Hi, I like your code reviews, they are leading me to simplify the code considerably, thanks!
Some of your comments seemed to be based on a previous version of the code. The getCredits() method has been totally removed, because its functionality is better served by adding the credits via a template. I just updated the documentation of the extension to cover this. You also mention that I need not put the return variable name in the @return part. Well, I'm not sure if I ever did that, but in any case the current code doesn't mention the return variable name in the @return part.
I agree there's no need for the URL parsing or the "title" parameter. This led me to remove other things that became unnecessary, such as the getWikipediaTitle() method, and the $parser and $wikipediaTitle static properties.
Thanks again for the review!
Can the parser function extract contents from a user-chosen section? For example, it currently can fetch full text, a number of sentences or the intro. Instead of intro it could fetch sections, leading section included:
changing
{{#WikipediaExtract: Title of the article | intro = true}}
with
{{#WikipediaExtract: Title of the article | section = 0}} /* Leading section */
Parameter "intro=true" could be an alias for "section=0".
Other sections would be like this one:
{{#WikipediaExtract: Title of the article | section = 1}} /* First named section "== ... ==" */
But there would be a problem. What would be taken as section by the extension? Level 2 "== ... =="? Level 3 "=== ... ==="? All levels? Could subsections be fetched with its parent section when requested if only a certain section level could be requested by the parser function? How to manage when a page is using a non-standard section layout (such as using level 1 sections "= ... =" or no sections at all)?
Could it be set to fetch contents from several wikis?
For example:
$wgWikipediaExtractsAPI = array ('//en.wikipedia.org/w/api.php', '//en.wiktionary.org/w/api.php', '//en.wikiquote.org/w/api.php' );
But if it would be like that, there should exist a way to solve a conflict between pages with the same name in several wikis.
@Zerabat WikipediaExtracts uses the API introduced by the TextExtracts extension, which doesn't include an option for extracting sections. Therefore, if you want to implement this functionality, you should probably look towards extending the TextExtracts API. Once that is done, extending WikipediaExtracts to take advantage of the extended API would be trivial. Now there's also the PageSummary API, which may introduce some other way of adding this functionality. Not sure though.
Regarding using contents from several wikis, it's a natural and desired addition to the current functionality. There are a couple of ways to solve the issue you mention, such as giving priority to wikis higher on the array, or giving a key to each API in the array, and requiring the user to input the key for the desired wiki into the parser function. However, all this extra functionality shouldn't be necessary for approving the current version of the extension. I've been waiting for it to get approved for MORE THAN TWO YEARS, so I'm not about to introduce any new functionality that may rise new concerns or bugs and delay the approval any further. Thanks for the interest and the suggestion though!
I understand your input. My suggestions were to a future version of the extension(s), not the current deployment, which has been feature-frozen in order to become stable.
Can anyone with the skill and permissions approve and deploy this extension? It's been more than two years already. Is this how the Wikimedia Foundation integrates features developed by the community?
@Sophivorus: That looks like a question for T149765: Deploy WikipediaExtracts extension?
Also looks like https://gerrit.wikimedia.org/r/#/c/399999/ is still awaiting a review :(
Change 399999 abandoned by Sophivorus:
Respond to second code review by Reedy
Reason:
Cannot merge anymore. Manually rebasing this change.
Change 485366 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy
Change 485366 merged by Sophivorus:
[mediawiki/extensions/WikipediaExtracts@master] Respond to second code review by Reedy
After some talk with the WikiJournal user group at Wikiversity, I decided to generalize this extension into InterwikiExtracts, archive WikipediaExtracts, and in some days start a new task for enabling InterwikiExtracts on Wikiversity, this time with the support of the WikiJournal user group. So this task makes no more sense and I'm closing it.