Page MenuHomePhabricator

Make random selection slightly more random
Closed, ResolvedPublic

Description

There are potential problems with 'clumping' where some entries may be assigned
duplicate cur_random entries or near-duplicates that bias favorably to some
entries in a large sample.

Avoiding duplicates, changing random keys, or adding jitter could help.


Version: unspecified
Severity: minor

Details

Reference
bz589

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:57 PM
bzimport set Reference to bz589.
bzimport added a subscriber: Unknown Object (MLST).

river wrote:

enwiki:

mysql> select cur_random,count(*) as c from cur group by cur_random having c > 1;
1466 rows in set (2.04 sec)

wmahan_04 wrote:

The code that generates cur_random is

$rand = number_format( mt_rand() / mt_getrandmax(), 12, '.', '' );

which means we round to 12 decimal digits of precision. So we would expect a
repeated value of cur_random after about 1.2*sqrt(10^12), or roughly a million
articles, using an approximation of the formula for the Birthday Problem. Thus 12
digits should be enough.

However, on my (32-bit) machine, mt_getrandmax() returns only 2147483647 == 2^31-1.
This means we get a much smaller set of random values, and would expect a collision
after only about 50K articles, assuming the same value of mt_getrandmax.

I think a solution would be to use the RAND() MySQL function instead. MySQL 3.23
and up allow ORDER BY RAND(), although I'm guessing this was rejected
in favor of the cur_random column for some reason.

ORDER BY RAND() was rejected for being ungodly slow on tables with tens of
thousands of rows.

Can't really think of any reason not to use RAND(), though... might have been
concern about replication? Need to check that.

Give me a break, who cares if 100 articles have the same cur_random? Who is
going to notice? It's not a collision, it's a non-issue. If Special:Randompage
is our only method of article review, we have bigger problems than machine word
length.

Correctness is always a concern, even if the affect is minor.

Just because we have better things to do doesn't mean it's not worth doing! :)

wmahan_04 wrote:

(In reply to comment #3)

Can't really think of any reason not to use RAND(), though... might have been
concern about replication? Need to check that.

Probably because it's not possible to pass RAND() to the database wrapper functions;
they expect literal values and quote everything that isn't numeric before
querying the DB. So to fix this we'd need to prevent Database::makeList from
quoting RAND(), or use a separate query to get the random value (slower?).

Re Tim's comment, my understanding is that this could prevent articles with
duplicate
cur_random values from ever being returned by the random page function. True,
not a big deal,
but it would be nice to fix since we have more than enough precision in the
cur_random
field. Also, it might be important for statistical surveys to know that a truly
random
sample is taken, although I can't claim to know whether this would make any
practical
difference.

Also, it might be important for statistical surveys to know that a truly
random sample is taken, although I can't claim to know whether this would make any
practical difference.

I do claim to know. If you take a population, randomly remove 100 members of
that population, then take a random sample of the remainder, that sample is
still representative of the original population.

What annoys me is superstition surrounding random numbers arising from the fact
that programmers don't understand them. Like when my parsing algorithm was
changed because there was a 1 in 2^64 probability of a minor parsing error on a
typical page, or 1 in 2^32 for a special kind of page 100 million times longer
than our page size limit. Never mind the IE expression() JS insertion bug, or
the countless other bugs that occur every single time you type in a particular
bit of text -- software engineers seem to have an inexplicable urge to tame
randomness even when it was just minding its own business.

As far as the database wrappers; iirc they already handle NULL distinctly from
strings/integers. It should be relatively straightforward to provide, say, a
wrapper class for raw SQL functions, which could be recognized and the raw
string passed through.

wmahan_04 wrote:

I thought of a simple fix for this problem: use two calls to
mt_rand() to get the required amount of randomness.

Creating a million random numbers using the current method,
I got 262 dupes. Instead using this function:

function wfRandom() {

$max = mt_getrandmax();
$rand = number_format( mt_rand() * mt_rand()
    / $max / $max, 12, '.', '' );
return $rand;

}

I got 0 dupes.

Also a note: we need to use a predetermined constant in Special:Randompage because RAND() in a WHERE clause doesn't
behave as expected (see comment I added to that file a few days ago), however RAND() should work fine when setting the
cur_random on page save.

wmahan_04 wrote:

(In reply to comment #10)

Also a note: we need to use a predetermined constant in Special:Randompage

because RAND() in a WHERE clause doesn't

behave as expected (see comment I added to that file a few days ago), however

RAND() should work fine when setting the

cur_random on page save.

Thanks for the clarification. I committed my wfRandom() function to HEAD,
and changed Special:Randompage as well as the article inserts to use it.
That was easier than using RAND() for the page saves, because of the
database wrapper difficulty I mentioned in comment #6. Besides, mt_rand()
appears to have better randomness properties than MySQL's RAND(), so it I
think it makes sense to be consistent and use mt_rand() for everything.

I don't think my change will hurt performance, since in my understanding,
none of the users is performance-criticial.

1.4 release imminent, resolving as fixed.

1.4 release imminent, resolving as fixed.