Page MenuHomePhabricator

simplewiktionary provides wrong stats
Closed, ResolvedPublic

Description

I tried to use the Compare feature of the Wiktionary Cognate Dashboard, setting frwikt as the source and simplewikt as the target. After the analysis, I get a list of entries that atre supposed to exist on simplewikt and not on frwikt. However, they do not exist on both Wiktionaries. For example, the first entries of the list are "대신에", "Aako", "abasia". "대신에" and "Aako" do not exist on both frwikt and simplewikt while "abasia" exists on both Wiktionaries.

Event Timeline

Pamputt created this task.Sep 5 2018, 9:27 PM

@Pamputt As ever, thank you for your input.

"대신에" and "Aako" do not exist on both frwikt and simplewikt

Both 대신에 and Aako exist on simple wiktionary. There is a difference between (1) a page that does not exist, and (2) a page that exists but has no text.

while "abasia" exists on both Wiktionaries.

That is correct. I will check this out. @Lea_Lacroix_WMDE This is probably a minor bug; my suggestion is to wait until the next update before we decide to do anything here.

Hmmm, it seems there is a specificity for simplewikt. Indeed, if I go here, I see clearly that "대신에" do not exist (red link). The same for "Aako". I do not know why they have a behaviour compared to other Wiktionaries.

@Pamputt I have checked for the abasia entry right now. As I have assumed: the dashboard "self-repaired" itself following one or two update cycles.

Namely, by comparing simplewikt as source and frwikt as target, we know find: abasias (please note the difference), which is really present in simplewikt and really absent from frwikt.

The case must be that (a) someone was editing, while (b) the change was not noticed in the dashboard because the update cycle (it runs every six hours, add two hours or so of computation and file migration) didn't get it yet.

Pamputt added a comment.EditedSep 6 2018, 6:22 PM

@GoranSMilovanovic indeed, the update has fixed some mistakes. However, I still continue to think there is a problem with simplewikt. In the new update, the list is "abasias", "abbottonare", "abcédographie". "abasias" was good (I created the page on frwikt). But "abbottonare" does not exist on simplewikt but do exist on frwikt. The same for "abcédographie" (simplewikt and frwikt).
The fourth word: "abdicable" is good. It does exist on simplewikt and not on frwikt...

@Pamputt From the instructions on the top of the Compare page:

The Dashboard will generate a table of all entries found in the Target, but not in the Source Wiktionary.

Now, we use source = simplewiktionary and target = frwiktionary, and expect to receive entries that are present in the frwiktionary but not in the simplewiktionary.

But "abbottonare" does not exist on simplewikt but do exist on frwikt.

Yes, that is what the instructions say we should expect to learn: entries present in the target (frwiktionary, in this case) but not in source (simplewiktionary in this case).

The same for "abcédographie" (simplewikt and frwikt).

Ibid.

@Pamputt Also, I have checked for: abotonaré. It does exist on the fr.wiktionary, and not on the simple.wiktionary. Which is exactly what the dashboard delivers when asked to compare for a source and a target Wiktionary: what is found in the target, but not found in the source. Hope this helps.

Once again: thanks for testing!

Actually, sorry for the mismatch but I set frwikt as source and simplewikt as target. Just now, I get the following list: "aabi", "abasie", "abdicable", "abdiceerden", "abduceretur", "abdukcji", ...
From this list, only "abdicable" exists on simplewikt. All other words do not ("aabi", "abasie", "abdiceerden", "abduceretur", "abdukcji"). Meanwhile, "abasie" and "abdiceerden" exist on frwikt (although frwikt is the source). And "aabi", "abduceretur" and "abdukcji" do not exist on both simplewikt and frwikt.

Hi @Pamputt

of course they all exist.

Once again, there is a difference between (a) the existence of page, and (b) the existence of the content of the page.

All of the pages that you refer to exist on simple.wiktionary, and their content is :

There is currently no text in this page. You can look for this page title in other pages, search the logs, or change this page.

If the page exists (with or without any useful content), the Wiktionary Cognate Dashboard (as well as the Cognate extension that the dashboard partially relies on) will pick it up.
The dashboard's update engine does not inspect the content of the pages.

That being said, I would agree with your observation that there is "a problem" with simple.wiktionary, or - if you allow me to rephrase the statement - they are doing things in a way that currently seems pretty confusing to us, but maybe they know why they're doing things in that way.

I still continue to think that these pages do not exist. I do not know where does its content come from but for sure it is not part of the page. For "aabi", for example, we can see this content here. Yet, from here, we see clearly that the page is a red link.
Moreover, we can see this text on all pages that do not exist. For example aaaa, aaaaa, ... have all the same content and do not exist on both simplewikt and frwikt. However, they are not listed by Wiktionary Cognate Dashboard whereas it is exactly the same case as "aabi". So if you are right, I expect to get a lot of pages present in simplewikt and not in frwikt. However, this is not the case.

Lea_Lacroix_WMDE added a comment.EditedSep 7 2018, 7:04 AM

@Pamputt Could it be a difference between pages that never existed, and pages that have been created in the past and then deleted? Since I don't have admin rights on simplewikt, I can't check the deletion history.

@Lea_Lacroix_WMDE it seems that aaa on simplewikt is one of this page. We see clearly that it has been deleted in the past. This is not the cas for other pages, such as "aabi".

GoranSMilovanovic added a comment.EditedSep 7 2018, 10:49 AM

@Pamputt I think you need to take into your consideration that the dashboard's update cycle takes a while (see: T203609#4563662).

Today, CET 12:45 approximately:

source: frwiktionary
target: simplewiktionary

Expected result: entries present in simple.wiktionary but not in fr.wiktionary.
Result: neither of the following are present in the result set: aaa, aaaa, aaaaa, aabi.

The dashboard's update cycle:

  1. A user makes a change;
  2. The change needs to be propagated to (a) the Cognate extension database, and (b) the Page table of the respective's Wiktionary database;
  3. The dashboard's update engine (running every six hours and taking approx. two hours to complete) needs to register the change in the databases;
  4. The dashboard's front-end (what you are using) needs to update (it attempts to update every hour, but updates only when a complete update engine run is registered).

It is difficult if not impossible to estimate precisely when will any change be in effect on the dashboard. The reason why this is impossible is that many different systems (two MW databases, the dashboard's update engine script, and the dashboard's front-end sync procedure) run asynchronously (and necessarily so). My approximate calculations suggest that changes become visible in some eight hours, but take into your consideration that this is a rough estimate only.

P.S. aabi is not present neither in fr.wiktionary nor in simple.wiktionary on the Wiktionary Cognate Dashboard, and the situation on the Wiktionaries themselves is the same:

https://simple.wiktionary.org/wiki/aabi
https://fr.wiktionary.org/wiki/aabi

@GoranSMilovanovic there was indeed probably a mismatch with previous updates. Now (18:00 UTC), the ten first entries of the list are correct (present in simplewikt and not in frwikt). There is no "empty" entry anymore. Everything looks good.

Hello,

How many Wiktionaries have the entry " ашәахь " ?

According to https://fr.wiktionary.org/wiki/%D0%B0%D1%88%D3%99%D0%B0%D1%85%D1%8C
I would like to answer only three;

But " I miss You " tool gives us 17 except en one.

Lydia_Pintscher closed this task as Resolved.Jan 7 2019, 10:37 AM

I'm going to mark this as resolved since it seems the issue with simplewiktionary has been resolved. If there are other issues please open specific new tickets <3