Page MenuHomePhabricator

strange empty nodes & loops in the tree
Closed, ResolvedPublic

Description

Hi,

Really great extension, thank you very much for working on this.
I have a person template, identical to https://genealogy.toolforge.org/wiki/Template:Person (but translated to another language). And I have several people with parent relationships.

However, when I try to create a tree I get a bunch of empty nodes in it. I tried downgrading to 1.9.3, but it doesn't work with the mediawiki 1.35.0. I tried all releases of the extensions above 2.0.0 including dev-master, and get this weird graph.

The relevant part of the template looks like this:

!style="vertical-align:top; text-align:right"| Parents:
|{{#if: {{{parent1|}}} | {{#genealogy:parent | {{{parent1}}} }} }}<!--
-->{{#if: {{{parent2|}}} | <br />{{#genealogy:parent | {{{parent2}}} }} }}<!--
-->{{#if: {{{parent3|}}} | <br />{{#genealogy:parent | {{{parent3}}} }} }}

And I try to make a tree using:

{{#genealogy:tree |ancestors=Person1 | format=mermaid}}

But I do not see parent1 & parent2 in the tree. Instead some other "random" people. I do not understand where the loops and empty nodes come from.

Could it be that something is cached incorrectly? If not, are there any suggestions on how to fix this?

Thank you very much for your help.

Event Timeline

idavydov created this task.Nov 28 2020, 6:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2020, 6:58 PM

I managed to isolate the problem. It seems that it ignores non-English letters in article names. See a minimal example here: http://testwiki.myths.ru/wiki/%D0%90

So if two genealogy entities have only consist of non-English letters, they all become somehow the same for the plugin. And it creates loops.

Here's a simple example. Let's create a page "А" (in Cyrillic). With the following content:

{{person
 | parent1 = Б
 | parent2 = В
}}

{{#genealogy:tree|ancestors=А|format=mermaid}}

Note that there are three different Cyrillic names. When we render the page we get a loop:

When using the same template, but with names X, Y, and Z instead of Cyrillic (А, Б and В), we get:

(see here http://testwiki.myths.ru/wiki/X )

idavydov added a comment.EditedDec 6 2020, 9:25 PM

The problem is how varId generates unique ids. The current code looks like this:

	protected function varId( $str ) {
		$var = preg_replace( '/[^a-zA-Z0-9_]/', '', str_replace( ' ', '_', $str ) );
		// Add a unique three-character suffix in case multiple input strings
		// normalize to the same output string above.
		$suffix = '_' . substr( md5( $var ), 0, 3 );
		return $var . $suffix;
	}

Meaning that all non-latin characters are removed, and then md5 has is computed on an empty string.

So all fully non-latin nodes will have the same ids.

Something like this should work fine:

protected function varId( $str ) {
        // Add a unique three-character suffix in case multiple input strings
        // normalize to the same output string above.
        $suffix = '_' . substr( md5( $str ), 0, 6 );
        $var = preg_replace( '/[^a-zA-Z0-9_]/', '', str_replace( ' ', '_', $str ) );
        return $var . $suffix;
}

Change 646110 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/Genealogy@master] Calculate graph ID prefixes from original string

https://gerrit.wikimedia.org/r/646110

Samwilson claimed this task.Dec 7 2020, 1:35 AM
Samwilson added a subscriber: Samwilson.

Thanks for reporting this! You're right, the quick fix is to use the original string to calculate the suffix; I've made a patch for that. It looks like it'd be nicer to use a transliteration of the original string, but I don't know if there's a simple way to do that here.

Thanks you very much! Given that for Cyrillic alphabet all the strings are going to be empty, wouldn't it be safer to use 5-6 characters instead of 3? Are there any downsides to that?

With 3 characters and 20 people you get a ~5% probability of having a collision.

This code snippet from the documentation seem to do transliteration:

var_dump(iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", transliterator_transliterate('Any-Latin; Latin-ASCII',
    "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est'. fi"

Ah, thanks for pointing out the //IGNORE feature; I'd overlooked that, and in my tests was getting errors. I've updated the patch; see what you think. It's easy enough to add more characters to the suffix, but maybe now with transliteration it'll be okay?

There's not really any requirement to use the original string as the ID; we could instead just use a long hash, but for debugging it's sometimes useful to see some relation to the actual page names — maybe that's not worth the hassle?

Thanks, I applied the patch and it seems to work perfectly. I think 3 digits of hash is enough with transliteration enabled (for a typical use-case). If I run into problems I will open a bug report.

Thanks again.

Change 646110 merged by jenkins-bot:
[mediawiki/extensions/Genealogy@master] Calculate better graph IDs

https://gerrit.wikimedia.org/r/646110

Samwilson closed this task as Resolved.Dec 8 2020, 10:52 PM

Great, thanks! I've released version 2.1.1.