Wikibits patch #10 - more numbers
Closed, ResolvedPublic

Description

Author: mhorvath2161

Description:
unified diff file

This patch allows more number formats to be detected properly when sorting tables.

Specifically, numbers beginning with "-" or "+", or containing "e" or "E" are handled properly.

This patch requires that the multiple "if" statements in this particular section also be changed to "if" and "if else" statements. Otherwise, the regexp for numeric values will falsely detect some values that are actually dates.


Version: unspecified
Severity: enhancement

attachment wikibits_more_numbers.diff ignored as obsolete

bzimport added a project: MediaWiki-Parser.Via ConduitNov 21 2014, 10:23 PM
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz15422.
bzimport created this task.Via LegacySep 1 2008, 4:12 PM
bzimport added a comment.Via ConduitSep 1 2008, 8:01 PM

ayg wrote:

Do all browsers support all of these formats? If some browsers don't support exponential formats, it might be best not to let those through. If they've all supported them since the beginning, though, which is quite possible, it's fine.

bzimport added a comment.Via ConduitSep 2 2008, 1:20 AM

mhorvath2161 wrote:

Javascript is based on the ECMA standard. It shouldn't matter what browser you use. This has nothing to do with character entities or text encoding or anything like that.

bzimport added a comment.Via ConduitSep 2 2008, 1:21 AM

mhorvath2161 wrote:

I just want to re-emphasize that this patch requires some changes made in one of the earlier patches (and not included here) in order to work.

bzimport added a comment.Via ConduitSep 2 2008, 2:06 AM

ayg wrote:

(In reply to comment #2)

Javascript is based on the ECMA standard. It shouldn't matter what browser you
use. This has nothing to do with character entities or text encoding or
anything like that.

Yeah, theory is a wonderful thing . . . it's probably correct on this point, I'm just being paranoid.

(In reply to comment #3)

I just want to re-emphasize that this patch requires some changes made in one
of the earlier patches (and not included here) in order to work.

Can you post a full patch that will implement this by itself? Or at least indicate exactly which other patch is needed, and which lines of that patch?

bzimport added a comment.Via ConduitSep 2 2008, 4:05 AM

mhorvath2161 wrote:

unified diff file

Sure, here you go.

attachment wikibits_more_numbers_b.diff ignored as obsolete

bzimport added a comment.Via ConduitSep 2 2008, 4:12 AM

mhorvath2161 wrote:

The code (the new and the original) searches for the percent sign (%) if it exists. I was wondering if there were any other symbols that should be searched for? I can't think of any off the top of my head, besides maybe the degrees symbol (°).

Actually, the plus-minus symbol (±) might be a good thing to catch, too.

The ¼ ½ and ¾ symbols could easily be replaced with decimal values.

² and ³ are a bit tougher. They would need to be evaluated as expressions in addition to being replaced.

bzimport added a comment.Via ConduitSep 2 2008, 4:14 AM

mhorvath2161 wrote:

The tricky thing is if these ANSI codes represent something different in other character encodings.

bzimport added a comment.Via ConduitSep 2 2008, 4:55 AM

mhorvath2161 wrote:

unified diff file

Added support for additional mathematics/science symbols as outlined in my previous post.

attachment wikibits_more_numbers_c.diff ignored as obsolete

bzimport added a comment.Via ConduitSep 2 2008, 6:01 PM

mhorvath2161 wrote:

unified diff file

Slight tweak.

attachment wikibits_more_numbers_c.diff ignored as obsolete

bzimport added a comment.Via ConduitSep 3 2008, 12:19 AM

ayg wrote:

(In reply to comment #8)

Added support for additional mathematics/science symbols as outlined in my
previous post.

I think this goes much too far. We don't need to write a mini-parser here. The treatment of ± is outright wrong, too: it makes no sense to try sorting such a thing relative to numbers known to be positive or negative.

I'm going to look at attachment 5256 instead:

  • sortfn = ts_sort_caseinsensitive;

+ var sortfn = ts_sort_caseinsensitive;

	if (itm.match(/^\d\d[\/. -][a-zA-Z]{3}[\/. -]\d\d\d\d$/))
  • sortfn = ts_sort_date;
  • if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
  • sortfn = ts_sort_date;
  • if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
  • sortfn = ts_sort_date;
  • if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
  • sortfn = ts_sort_currency;
  • if (itm.match(/^[\d.,]+\%?$/))
  • sortfn = ts_sort_numeric;

+ var sortfn = ts_sort_date;
+ else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
+ var sortfn = ts_sort_date;
+ else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
+ var sortfn = ts_sort_date;
+ else if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
+ var sortfn = ts_sort_currency;
+ else if (itm.match(/^[\d.,-+eE]+\%?$/))
+ var sortfn = ts_sort_numeric;

Why are you adding "var" before each sortfn assignment? That should only be needed on the initial assignment. Even if not, *please* keep one thing to one patch. Same for if -> else if. The substantive change here *seems* to be one line, but I'd prefer not to have to look through ten times as many lines to find it:

  • if (itm.match(/^[\d.,]+\%?$/))

+ if (itm.match(/^[\d.,-+eE]+\%?$/))

Note that this seems incorrect, "-" has a special meaning here and should be at the end of the character class. I've committed this line's change as r40348, with the character class reordered so "-" is at the end.

I've also committed the stylistic changes *separately* as r40349 -- except with only the first variable declaration marked local, not all of them. If there's some reason to mark all of them with "var" separately, let me know.

bzimport added a comment.Via ConduitSep 3 2008, 2:21 AM

mhorvath2161 wrote:

(In reply to comment #10)

(In reply to comment #8)
> Added support for additional mathematics/science symbols as outlined in my
> previous post.
I think this goes much too far. We don't need to write a mini-parser here.
The treatment of ± is outright wrong, too: it makes no sense to try sorting
such a thing relative to numbers known to be positive or negative.

I don't think it's overkill. There are only a few ANSI characters used in mathematics that aren't already understood by javascript. As for the ± symbol, I wouldn't expect it to present a problem the way it is typically used.

I'm going to look at attachment 5256 [details] instead:
>- sortfn = ts_sort_caseinsensitive;
>+ var sortfn = ts_sort_caseinsensitive;
> if (itm.match(/^\d\d[\/. -][a-zA-Z]{3}[\/. -]\d\d\d\d$/))
>- sortfn = ts_sort_date;
>- if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
>- sortfn = ts_sort_date;
>- if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
>- sortfn = ts_sort_date;
>- if (itm.match(/^[\u00a3$\u20ac]/)) pound dollar euro
>- sortfn = ts_sort_currency;
>- if (itm.match(/^[\d.,]+\%?$/))
>- sortfn = ts_sort_numeric;
>+ var sortfn = ts_sort_date;
>+ else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
>+ var sortfn = ts_sort_date;
>+ else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
>+ var sortfn = ts_sort_date;
>+ else if (itm.match(/^[\u00a3$\u20ac]/))
pound dollar euro
>+ var sortfn = ts_sort_currency;
>+ else if (itm.match(/^[\d.,-+eE]+\%?$/))
>+ var sortfn = ts_sort_numeric;
Why are you adding "var" before each sortfn assignment? That should only be
needed on the initial assignment. Even if not, *please* keep one thing to one
patch. Same for if -> else if. The substantive change here *seems* to be one
line, but I'd prefer not to have to look through ten times as many lines to
find it:

First of all, the statements are preceded by var because they *are* the first instance. There's no sense in defining it an extra time just to eliminate some characters when it's not necessary. Secondly you specifically asked me to include the additional code needed for this patch to work. I quote:

Can you post a full patch that will implement this by itself? Or at least
indicate exactly which other patch is needed, and which lines of that patch?

>- if (itm.match(/^[\d.,]+\%?$/))
>+ if (itm.match(/^[\d.,-+eE]+\%?$/))
Note that this seems incorrect, "-" has a special meaning here and should be at
the end of the character class. I've committed this line's change as r40348,
with the character class reordered so "-" is at the end.

Good catch.

I've also committed the stylistic changes *separately* as r40349 -- except with
only the first variable declaration marked local, not all of them. If there's
some reason to mark all of them with "var" separately, let me know.

As I explained, the "var"s are necessary, otherwise you are creating a *global* variable within each condition. This is unneeded additional overhead.

bzimport added a comment.Via ConduitSep 3 2008, 2:30 AM

mhorvath2161 wrote:

(In reply to comment #11)

(In reply to comment #10)
> Why are you adding "var" before each sortfn assignment? That should only be
> needed on the initial assignment. Even if not, *please* keep one thing to one
> patch. Same for if -> else if. The substantive change here *seems* to be one
> line, but I'd prefer not to have to look through ten times as many lines to
> find it:
First of all, the statements are preceded by var because they *are* the first
instance. There's no sense in defining it an extra time just to eliminate some
characters when it's not necessary. Secondly you specifically asked me to
include the additional code needed for this patch to work. I quote:

Ah, I see my error now. The block of code is supposed to look like this:

+ if (itm.match(/^\d\d[\/. -][a-zA-Z]{3}[\/. -]\d\d\d\d$/))
+ var sortfn = ts_sort_date;
+ else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d\d\d$/))
+ var sortfn = ts_sort_date;
+ else if (itm.match(/^\d\d[\/.-]\d\d[\/.-]\d\d$/))
+ var sortfn = ts_sort_date;
+ else if (itm.match(/^[\u00a3$\u20ac]/)) // pound dollar euro
+ var sortfn = ts_sort_currency;
+ else if (itm.match(/^[\d.,+eE-]+\%?$/))
+ var sortfn = ts_sort_numeric;
+ else
+ var sortfn = ts_sort_caseinsensitive;

bzimport added a comment.Via ConduitSep 3 2008, 2:36 AM

ayg wrote:

(In reply to comment #11)

I don't think it's overkill. There are only a few ANSI characters used in
mathematics that aren't already understood by javascript.

Why just [[Latin-1]] characters (more proper term than "[[ANSI]]") and not Unicode generally? It's a slippery slope here, especially ² and ³. When is anyone ever going to square tabulated data? ± might be worth it, but Unicode fractions are rarely used, and if you support any you should support all, not just the Latin-1 ones. % might be useful, but it's already treated more or less correctly by just stripping it: the only issue would be if people mix percents and normal numbers in the same table, which is weird.

First of all, the statements are preceded by var because they *are* the first
instance. There's no sense in defining it an extra time just to eliminate some
characters when it's not necessary.

No, they aren't the first instance. To wit:

var sortfn = ts_sort_caseinsensitive;

That's executed unconditionally at the start.

Secondly you specifically asked me to
include the additional code needed for this patch to work. I quote:

By that I meant include additional code needed for this patch to *work*, not additional code needed for the patch to *apply* in its unduly modified form. Of course your original patch was one line, but I didn't remember that when I wrote my last comment, what with the other ten patches by you I've been looking at. I thought you meant a meaningful dependency, not a formatting dependency, which is why I said it should be included in the patch. Patches should be *formatted* to apply against current trunk.

(In reply to comment #12)

Ah, I see my error now. The block of code is supposed to look like this:

Which is one line plus 4*5 = 20 characters longer. :)

bzimport added a comment.Via ConduitSep 3 2008, 4:20 AM

mhorvath2161 wrote:

(In reply to comment #13)

(In reply to comment #11)
> I don't think it's overkill. There are only a few ANSI characters used in
> mathematics that aren't already understood by javascript.
Why just [[Latin-1]] characters (more proper term than "[[ANSI]]") and not
Unicode generally? It's a slippery slope here, especially ² and ³. When is
anyone ever going to square tabulated data? ± might be worth it, but Unicode
fractions are rarely used, and if you support any you should support all, not
just the Latin-1 ones. % might be useful, but it's already treated more or
less correctly by just stripping it: the only issue would be if people mix
percents and normal numbers in the same table, which is weird.

Yes, the powers of 2 and 3 were maybe overkill (and lead to some false results). I'll post a trimmed-down version. The fractions ½ ¼ ¾ require only simple substitutions, no evaluation of any kind. Percentages are also an easy case.

> First of all, the statements are preceded by var because they *are* the first
> instance. There's no sense in defining it an extra time just to eliminate some
> characters when it's not necessary.
No, they aren't the first instance. To wit:

var sortfn = ts_sort_caseinsensitive;

That's executed unconditionally at the start.

Hence my correction.

> Secondly you specifically asked me to
> include the additional code needed for this patch to work. I quote:
By that I meant include additional code needed for this patch to *work*, not
additional code needed for the patch to *apply* in its unduly modified form.
Of course your original patch was one line, but I didn't remember that when I
wrote my last comment, what with the other ten patches by you I've been looking
at. I thought you meant a meaningful dependency, not a formatting dependency,
which is why I said it should be included in the patch. Patches should be
*formatted* to apply against current trunk.

The changes *were* required in order for it to work. In its original form, the numerical search might have picked up a date when it wasn't supposed to.

> Ah, I see my error now. The block of code is supposed to look like this:
Which is one line plus 4*5 = 20 characters longer. :)

You're missing the point. Declaring the variable at the beginning results in an additional assignment and an (expensive) additional lookup in global scope in 5 out of the 6 cases.

bzimport added a comment.Via ConduitSep 3 2008, 4:22 AM

mhorvath2161 wrote:

unified diff file

Scaled-down version of the last patch.

attachment wikibits_more_numbers_d.diff ignored as obsolete

bzimport added a comment.Via ConduitSep 3 2008, 4:27 AM

mhorvath2161 wrote:

Also, you retained a bug from the previous version when you committed the change.

In /^[\d.,eE+-]+\%?$/ the e and E may appear at the beginning of the text, which is not a number.

bzimport added a comment.Via ConduitSep 3 2008, 5:48 AM

mhorvath2161 wrote:

unified diff file

The previous versions of the code were picking up a lot of (admittedly unlikely) false positives. Hopefully I've eliminated them. I'll also upload the test script I've been using just in case.

attachment wikibits_more_numbers_d.diff ignored as obsolete

bzimport added a comment.Via ConduitSep 3 2008, 5:57 AM

mhorvath2161 wrote:

unified diff file

Forgot something in the last version.

attachment wikibits_more_numbers_d.diff ignored as obsolete

bzimport added a comment.Via ConduitSep 3 2008, 5:59 AM

mhorvath2161 wrote:

test script

Here's the test script I've been using to eliminate errors.

attachment js_numeric_test.js ignored as obsolete

bzimport added a comment.Via ConduitSep 3 2008, 6:09 AM

mhorvath2161 wrote:

unified diff file

Cought something else.

Attached: wikibits_more_numbers_d.diff

bzimport added a comment.Via ConduitSep 3 2008, 6:10 AM

mhorvath2161 wrote:

test script

Update to the test script.

Attached: js_numeric_test.js

bzimport added a comment.Via ConduitSep 3 2008, 2:03 PM

ayg wrote:

Please make all new patches against SVN trunk. Your current patches wouldn't have applied (even if I wanted to apply them) because of r40348 and r40349. Before you make changes to submit a patch, right-click on wikibits.js or the directory and choose "Update" or something like that, to make sure the copy you're patching against is up-to-date.

(In reply to comment #14)

The fractions ½ ¼ ¾ require only
simple substitutions, no evaluation of any kind.

How about the fractions ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞? And you need to consider "1 ½", "1-½", etc. Nobody even uses these, [[WP:MOS]] and most other guides recommend against them. It adds bloat for little gain here.

The changes *were* required in order for it to work. In its original form, the
numerical search might have picked up a date when it wasn't supposed to.

Well, you should *say* these things, so that I don't make incorrect assumptions about what your patch is supposed to do. It wasn't obvious.

You're missing the point. Declaring the variable at the beginning results in an
additional assignment and an (expensive) additional lookup in global scope in 5
out of the 6 cases.

You're missing the point. *Performance does not matter here.* This piece of code is run *once* for every time the user clicks the button. You're talking probably under a microsecond's difference for each *click*. The physical process of clicking the mouse button takes thousands of times as long.

Stop worrying about performance when you don't need to. It makes code uglier and harder to understand and maintain.

(In reply to comment #16)

Also, you retained a bug from the previous version when you committed the
change.

In /^[\d.,eE+-]+\%?$/ the e and E may appear at the beginning of the text,
which is not a number.

There are a lot more possible problems than that, in fact. + and - can also appear anywhere, you can have multiple dots, etc. It doesn't support hex. I've created a new regex based on the ECMAScript standard:

http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

I've committed it in r40379 (with a fix in r40382), so there should be nothing incorrect detected as numbers anymore.

bzimport added a comment.Via ConduitSep 3 2008, 11:18 PM

mhorvath2161 wrote:

(In reply to comment #22)

Please make all new patches against SVN trunk. Your current patches wouldn't
have applied (even if I wanted to apply them) because of r40348 and r40349.
Before you make changes to submit a patch, right-click on wikibits.js or the
directory and choose "Update" or something like that, to make sure the copy
you're patching against is up-to-date.
(In reply to comment #14)
> The fractions ½ ¼ ¾ require only
> simple substitutions, no evaluation of any kind.
How about the fractions ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞? And
you need to consider "1 ½", "1-½", etc. Nobody even uses these, [[WP:MOS]]
and most other guides recommend against them. It adds bloat for little gain
here.

It seems to be used quite frequently, and is even included in the JavaScript character insertion toolbox of the edit page.

> The changes *were* required in order for it to work. In its original form, the
> numerical search might have picked up a date when it wasn't supposed to.
Well, you should *say* these things, so that I don't make incorrect assumptions
about what your patch is supposed to do. It wasn't obvious.

I did say so in message #5.

> You're missing the point. Declaring the variable at the beginning results in an
> additional assignment and an (expensive) additional lookup in global scope in 5
> out of the 6 cases.
You're missing the point. *Performance does not matter here.* This piece of
code is run *once* for every time the user clicks the button. You're talking
probably under a microsecond's difference for each *click*. The physical
process of clicking the mouse button takes thousands of times as long.
Stop worrying about performance when you don't need to. It makes code uglier
and harder to understand and maintain.

Might I remind you that my previous optimizations were similarly labeled trivial, yet I was able to achieve a twenty-fold increase in javascript performance. Also, lag in user interface elements is particularly insidious.

(In reply to comment #16)
> Also, you retained a bug from the previous version when you committed the
> change.
>
> In /^[\d.,eE+-]+\%?$/ the e and E may appear at the beginning of the text,
> which is not a number.
There are a lot more possible problems than that, in fact. + and - can also
appear anywhere, you can have multiple dots, etc. It doesn't support hex.

Yes, hence my subsequent patches. I feel your lack of keeping current with the latest changes to be suspect.

I've created a new regex based on the ECMAScript standard:
http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
I've committed it in r40379 (with a fix in r40382), so there should be nothing
incorrect detected as numbers anymore.

There are several issues with your changes:

1 commas can appear at the beginning of a number
2 multiple commas may appear in a sequence or separated by fewer than three digits
3 multiple periods may appear in a single number
4 commas may appear after a period
5 sequences surrounded by parentheses are stored as matches when they don't need to be

All of which I have resolved in my latest changes.

bzimport added a comment.Via ConduitSep 3 2008, 11:26 PM

mhorvath2161 wrote:

(In reply to comment #22)

Please make all new patches against SVN trunk. Your current patches wouldn't
have applied (even if I wanted to apply them) because of r40348 and r40349.
Before you make changes to submit a patch, right-click on wikibits.js or the
directory and choose "Update" or something like that, to make sure the copy
you're patching against is up-to-date.
(In reply to comment #14)
> The fractions ½ ¼ ¾ require only
> simple substitutions, no evaluation of any kind.
How about the fractions ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞? And
you need to consider "1 ½", "1-½", etc. Nobody even uses these, [[WP:MOS]]
and most other guides recommend against them. It adds bloat for little gain
here.

Forgot to add a link to support the frequency of the character's usage:

http://www.google.com/search?hl=en&q=site%3Aen.wikipedia.org+%C2%BD&aq=f&oq=site%3Aen.wikipedia.org+%C2%BD%22

Also, I don't think "1 ½" "1-½" are valid.

There are a lot more possible problems than that, in fact. + and - can also
appear anywhere, you can have multiple dots, etc. It doesn't support hex.
I've created a new regex based on the ECMAScript standard:
http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
I've committed it in r40379 (with a fix in r40382), so there should be nothing
incorrect detected as numbers anymore.

Having three separate regexps sharing multiple sequences also slows down the code.

bzimport added a comment.Via ConduitSep 8 2008, 4:07 PM

ayg wrote:

(In reply to comment #23)

It seems to be used quite frequently, and is even included in the JavaScript
character insertion toolbox of the edit page.

It still adds complication . . . I'm not sure. I'd like some more opinions on this.

I did say so in message #5.

Providing a patch that also made changes that clearly weren't necessary for this to work (adding var), and where you didn't explain why changing "if" to "else if" was actually necessary -- which wasn't obvious at all.

Might I remind you that my previous optimizations were similarly labeled
trivial, yet I was able to achieve a twenty-fold increase in javascript
performance. Also, lag in user interface elements is particularly insidious.

I'll reserve judgment on your other optimizations (not *all* of which anyone called trivial, only some of them) until I've reviewed them. Microseconds or less per click is not insidious, it's invisible.

Yes, hence my subsequent patches. I feel your lack of keeping current with the
latest changes to be suspect.

"Suspect"? I'm a volunteer here. I'm kindly setting aside my own time to review your patches out of the goodness of my heart. It's pretty rude of you to question my commitment when I have no obligation to pay attention to you at all, and when in fact most patches to MediaWiki go totally unreviewed and are ignored for months or years.

Your subsequent patches were totally unexplained, so I didn't realize what exactly you had done with them. If you had *pointed out* what problems they fixed, I might not have seen the need to make that comment. Besides, your subsequent patches make mistakes that mine didn't. They consider, for instance, the single string "." to be a number.

1 commas can appear at the beginning of a number

Yeah, I spotted that when re-reviewing one of your patches above. I've fixed that in r40611.

2 multiple commas may appear in a sequence or separated by fewer than three
digits

I'm not sure if that's good or bad. Is sorting something like "40,52" as the number 4052 better or worse than sorting it as a string? Sorting "," as 0 or NaN or whatever it evaluates to is definitely wrong, and I've fixed that.

3 multiple periods may appear in a single number

How? I don't see how that's possible with the regexes I committed.

4 commas may appear after a period

It doesn't hurt anything. I guess it would simplify the regex a bit if that didn't happen, but the commas will be stripped before evaluation anyway.

5 sequences surrounded by parentheses are stored as matches when they don't
need to be

Chucking in extra "?:" everywhere makes the code uglier for no good reason.

(In reply to comment #24)

Also, I don't think "1 ½" "1-½" are valid.

Why not?

Having three separate regexps sharing multiple sequences also slows down the
code.

I'm simply not going to consider any claims of slowing down anything unless you can provide either benchmarks or a convincing theoretical argument for why the slowdown might be nontrivial. Some microoptimizations are worth making, but most aren't except in performance-critical code paths.

bzimport added a comment.Via ConduitSep 9 2008, 11:39 AM

mhorvath2161 wrote:

(In reply to comment #25)

(In reply to comment #23)
> I did say so in message #5.
Providing a patch that also made changes that clearly weren't necessary for
this to work (adding var), and where you didn't explain why changing "if" to
"else if" was actually necessary -- which wasn't obvious at all.

I already explained earlier in this thread that the "var"s are necessary, otherwise you end up possibly overriding the regexps for the dates.

> Yes, hence my subsequent patches. I feel your lack of keeping current with the
> latest changes to be suspect.
"Suspect"? I'm a volunteer here. I'm kindly setting aside my own time to
review your patches out of the goodness of my heart. It's pretty rude of you
to question my commitment when I have no obligation to pay attention to you at
all, and when in fact most patches to MediaWiki go totally unreviewed and are
ignored for months or years.

I am also a volunteer, and am no less obligated to submit patches to the software. Don't act like you're doing me a favor.

Your subsequent patches were totally unexplained, so I didn't realize what
exactly you had done with them. If you had *pointed out* what problems they
fixed, I might not have seen the need to make that comment.

Each of my patches is accompanied by a comment as to the nature of the changes. They're easily enough compared with previous versions using the "diff" command that exists next to the link.

Besides, your
subsequent patches make mistakes that mine didn't. They consider, for
instance, the single string "." to be a number.

No you're wrong. The only period in the regexp is escaped, and therefore only matches a literal dot.

> 1 commas can appear at the beginning of a number
Yeah, I spotted that when re-reviewing one of your patches above. I've fixed
that in r40611.
> 2 multiple commas may appear in a sequence or separated by fewer than three
> digits
I'm not sure if that's good or bad. Is sorting something like "40,52" as the
number 4052 better or worse than sorting it as a string? Sorting "," as 0 or
NaN or whatever it evaluates to is definitely wrong, and I've fixed that.

It's been a couple of days since I worked on it, so I'm a bit hazy on why this might be undesirable.

> 3 multiple periods may appear in a single number
How? I don't see how that's possible with the regexes I committed.

My mistake.

> 4 commas may appear after a period
It doesn't hurt anything. I guess it would simplify the regex a bit if that
didn't happen, but the commas will be stripped before evaluation anyway.
> 5 sequences surrounded by parentheses are stored as matches when they don't
> need to be

Most of the speed benefits (see below) are a direct result of this optimization.

Chucking in extra "?:" everywhere makes the code uglier for no good reason.
(In reply to comment #24)
> Also, I don't think "1 ½" "1-½" are valid.
Why not?

Authors would omit the space simply for the sake of readability. The minus sign in this case indicates and expression not a number, and expressions should not be picked up by the regexp.

> Having three separate regexps sharing multiple sequences also slows down the
> code.
I'm simply not going to consider any claims of slowing down anything unless you
can provide either benchmarks or a convincing theoretical argument for why the
slowdown might be nontrivial. Some microoptimizations are worth making, but
most aren't except in performance-critical code paths.

I'll attach the demonstration file where you can compare my and your regular expressions side by side. After five trials the results were:

2237ms for the current method
1996ms for the current regexp plus the "?:" at the beginning of the parentheses
1943ms for the method in my latest patch

bzimport added a comment.Via ConduitSep 9 2008, 11:42 AM

mhorvath2161 wrote:

javascript demonstration

Comparison of three methods of writing regular expressions. The regular expressions are tested for speed versus an identical input string. The results are sent to standard output.

Attached: regexp_speed_comparison.js

bzimport added a comment.Via ConduitSep 14 2008, 7:00 PM

ayg wrote:

(In reply to comment #26)

Each of my patches is accompanied by a comment as to the nature of the changes.
They're easily enough compared with previous versions using the "diff" command
that exists next to the link.

Not an informative comment. Things like "Forgot something in the last version" and "Caught something else" don't tell me what the actual difference is.

In any event, there's no point bickering about this. I'm telling you how to best get your patches reviewed. If you don't follow that advice -- although mostly you have followed it so far with regard to splitting patches up and so forth, and I thank you for that -- then you're going to drive away reviewers, including me, who are uninterested in spending the time to decipher, cut pieces out of, or reformat your patches when they could just write their own, or ignore the problem entirely.

> Besides, your
> subsequent patches make mistakes that mine didn't. They consider, for
> instance, the single string "." to be a number.

No you're wrong. The only period in the regexp is escaped, and therefore only
matches a literal dot.

Yes, that's my point. The literal string "." is counted as a number. Look at the regex:

/^[+-]?(?:\d{1,3}(?:\,\d{3})*)*\.?\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?$/

This can match nothing:

[+-]?(?:\d{1,3}(?:\,\d{3})*)*

This can also match nothing:

\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?

The part in between is this:

\.?

and that can match the literal string ".". For that matter, your pattern also matches the empty string, since every piece is followed by either ? or *.

> > Also, I don't think "1 ½" "1-½" are valid.
> Why not?

Authors would omit the space simply for the sake of readability. The minus sign
in this case indicates and expression not a number, and expressions should not
be picked up by the regexp.

"1 ½" is the same as "1½", which is the same as "1-½" (the hyphen/dash there is not a minus sign, it just separates the fractional part from the integer part). Googling "1 ½" shows a mix of the three formats:

http://www.google.com/search?q="1+½"

I'll attach the demonstration file where you can compare my and your regular
expressions side by side. After five trials the results were:

2237ms for the current method
1996ms for the current regexp plus the "?:" at the beginning of the parentheses
1943ms for the method in my latest patch

Yes, if you *run the code a hundred thousand times*. It's not run 100,000 times. It's run *once* when the user clicks the sort button. The difference is 22.37 *microseconds* for my code and 19.43 *microseconds* for your code, per click. So in other words, correct me if I'm mistaking how this code is used, you're making changes to the code that save the user three microseconds when they click on the sort button. Am I misconstruing how often this code runs, or do you get my point that it makes absolutely no difference to the speed of the application?

If I'm not wrong, we're basically down to the following differences between your version and mine:

  • Your patch treats fractions and percent signs more intelligently. I'm not sure if we really want to go in this direction. It's possibly scope creep, and there's no end to it. I'd like input from some more people on this, I'm not going to commit it by myself.
  • You're a little more picky about where commas go. Again, I'm not sure if this is needed or not. I won't do anything without further input from others.
  • Your version uses regexes that save maybe two microseconds per click. Unless I'm totally mistaken on how many times that code is executed, in which case I retract my remarks, I hope you agree we don't need to worry about complicating the regex over this.
bzimport added a comment.Via ConduitSep 17 2008, 1:21 AM

mhorvath2161 wrote:

(In reply to comment #28)

Yes, that's my point. The literal string "." is counted as a number. Look at
the regex:
/^[+-]?(?:\d{1,3}(?:\,\d{3})*)*\.?\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?$/
This can match nothing:
[+-]?(?:\d{1,3}(?:\,\d{3})*)*
This can also match nothing:
\d*(?:[eE][+-]?\d+(?:\,\d{3})*)?[%\u00b0\u00bc\u00bd\u00be]?
The part in between is this:
\.?
and that can match the literal string ".". For that matter, your pattern also
matches the empty string, since every piece is followed by either ? or *.

I've been trying to find a way to fix this, but the only solutions I've been able to come up with involve slapping additional tests onto the end, such as " && (itm != '') ", which I'd rather not do.

> > > Also, I don't think "1 ½" "1-½" are valid.
> > Why not?
>
> Authors would omit the space simply for the sake of readability. The minus sign
> in this case indicates and expression not a number, and expressions should not
> be picked up by the regexp.
"1 ½" is the same as "1½", which is the same as "1-½" (the hyphen/dash there
is not a minus sign, it just separates the fractional part from the integer
part). Googling "1 ½" shows a mix of the three formats:
http://www.google.com/search?q="1+½"

OK, but the same might be said of the percent sign that exists in the current revision; and, at least the version with the space in between (the most common of the two) could be caught by simply adding an "\s*" beforehand.

  • Your version uses regexes that save maybe two microseconds per click. Unless I'm totally mistaken on how many times that code is executed, in which case I retract my remarks, I hope you agree we don't need to worry about complicating the regex over this.

No, it's not a lot of time, but I like to be thorough.

bzimport added a comment.Via ConduitOct 8 2008, 6:38 PM

ayg wrote:

Being "thorough" should not equate to writing less readable code. The extra characters are not needed and so are just clutter.

Regardless, as I said, I'm not going to commit any further changes to the things brought up here (complexity creep) unless other developers are in favor. So I have nothing more to say here unless a third party comments.

DieBuche added a comment.Via ConduitApr 15 2011, 10:08 AM

Stuff like e/E and -/+ is fixed in r86088 (I think e/E was patched in the meantime as well)

I agree that ½ is unneccessary

Add Comment