Page MenuHomePhabricator

Word count for field "Biography" ignores non-latin
Open, LowestPublicBUG REPORT

Description

Author: wikimedia.org

Description:
I have a Hebrew-language Wiki and I had to set Biography words to 0, or the form ignores submitted Biography, always claiming there are not enough words. a simple Latin Lorem Ipsum works fine. I suppose it's some sort of a Regular expression for the tokenizer/word counter that needs adjusting.


Version: REL1_21-branch
Severity: major

Details

Reference
bz55734

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:35 AM
bzimport set Reference to bz55734.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

I have a Hebrew-language Wiki and I had to set Biography words to 0, or the
form ignores submitted Biography, always claiming there are not enough
words. a
simple Latin Lorem Ipsum works fine. I suppose it's some sort of a Regular
expression for the tokenizer/word counter that needs adjusting.

Aklapper lowered the priority of this task from Medium to Lowest.Dec 29 2014, 12:29 AM
Aklapper subscribed.

Hi is this still a problem since it has been 2 years since mediawiki 1.21 has been released and 1.21 is now unsupported.

Aklapper changed the subtype of this task from "Task" to "Bug Report".Aug 18 2022, 9:06 PM
Aklapper removed a subscriber: wikibugs-l-list.

This is because internally ConfirmAccount uses PHP's str_word_count (line 141 of /includes/business/AccountRequestSubmission.php as of the master version of the extension), which is known to be buggy for UTF-8 characters.

In 38f20cc79815bbe588362e84ba106364e8164d15 back in 2019, to fix T60280, I somewhat "fixed" a similar issue for ArticleFeedbackv5 by using a custom, user-submitted UTF-8-aware str_word_count implementation. It's hardly perfect as it still returns incorrect results at least for CJK languages...and perhaps even Hebrew, which is the language mentioned here in the initial report.

Still, copying what AFTv5 does is probably an improvement over the status quo, even if not by much. However, I'd hope that something better would be available now, 6 years later.

cc'ing @MarkAHershberger for awareness & thoughts on this.