Provide statistics about Cognate on Wiktionary
Closed, ResolvedPublic

Description

A few requests from users who would like to access statistics about the usage of Cognate on Wiktionaries:

  • most interlinked entries not having a page on your own wiki
  • matrix of the number of interwiki's between each possible pair of wiktionaries

This task has been achieved through the Wiktionary Cognate Dashboard.
New feature requests can be tracked with T202610.

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2017, 7:58 AM
Addshore updated the task description. (Show Details)Aug 31 2017, 10:47 AM
Pamputt added a subscriber: Pamputt.Jan 7 2018, 5:30 PM
Noe added a subscriber: Noe.Mar 8 2018, 3:38 PM

Hey @GoranSMilovanovic, here's a bit more context about these two requests.

Reminder about Wiktionaries: Wiktionaries describe all the words in all the languages. There are currently 173 language versions, each having their own community and set of pages, called entries. Wiktionaries describe words coming from their own languages as well as other languages. Example: en:pain.

Basics about Cognate: Cognate provides automatic links between two pages of different language versions of Wiktionary that have the same title (including a few normalization rules). So for example, fr:tree and en:tree. On these pages, the automatic interwikilinks can be found in the menu column. For more information about where and how exactly the information is stored, better ask @Addshore.

Request 1: most interlinked entries not having a page on your own wiki
The idea here is to give to the editors of a language version, some ideas on what new pages to create on their home wiki. So, if I'm editing French Wiktionary, I'd be interested in the words (whatever the language), that already have a page on many other Wiktionaries, but not the French one. That's probably the most interesting/useful ones. That's why I want a list of the entries that already exist in a lot of languages, but not mine.
In term of interface, this could start with selecting a language, and then having a list of 50 words, with the number of other Wiktionaries having an entry, but no entry in the selected language.

Request 2: matrix of the number of interwikis between each possible pair of Wiktionaries
The goal here is to have an overview of the connections between the Wiktionaries, by displaying, for each pair of Wiktionaries, the number of pages that are connected with each other through Cognate.
The format here would probably be a very big table with the 173 language versions as rows and columns, and enter the number of connections for each pair. Or to select two languages in two fields, and then have the number of connections displayed.
Note: these statistics probably don't have to be updated in real time. One update per month should be enough, that's the frequency of the other stats updates from French Wiktionary.

If anything is still unclear or if you have questions, let me know!

@Lea_Lacroix_WMDE You can begin to test your Wiktionary Cognate Dashboard.

NOTES:

  • What is now present in the Dashboard should provide for your Request 2.
  • We can certainly make this a daily update - or even go for an hourly update, I would say. However, given that your Request 1 is a bit tricky in terms of data engineering (and that is why I've decided to solve the second problem first), let me finish the code that provides for it and then I will let you know of the update schedule. Fine?

Please let me know whether the current version of the Dashboard does everything that you would expect in respect to Request 2. Thank you!

GoranSMilovanovic added a comment.EditedJul 8 2018, 12:13 AM

@Lea_Lacroix_WMDE

Your Wiktionary Cognate Dashboard is ready.

  • The new I miss you tab should provide for your Request 1: users can select a Wiktionary and learn what are the 1,000 most shared entries across other Wiktionaries that their project does not encompass yet.

Please review this dashboard and let me know if everything is Ok.
When you give me a go, I will contact Analytics-Engineering for a public data review process and then put the dashboard on regular daily updates.

Thank you!

@GoranSMilovanovic very nice work.
In addition to Lea's request, would it be possible to list all the articles in a given language that exist on another Wiktionary edition but not in the edition of this language. For example, I would be nice to have a list of all French words that exist in a Wiktionary editions, but not in the French Wiktionary.

Pamputt added a comment.EditedJul 8 2018, 9:41 AM

@GoranSMilovanovic one comment about "I miss you". Would it be possible to remove the "Main_Page" and "main_Page" from the page title? They are present in most of the Wiktionary editions: fiwiktionary, frwiktionary, gnwiktionary and so on.

In addition for the frwiktionary list, why "1", "0", "-a" and "10" are listed when they already exist on the French Wiktionary?

GoranSMilovanovic added a comment.EditedJul 8 2018, 10:09 AM

@Pamputt No worries, will do. The reason why Main_Page and main_page are present there at all is that - for some technical reasons of consistency checks - a part of the dashboard's update engine still runs across all Wiktionary page SQL tables and not the cognate table. In a flick of an eye.

Now, as of the following Q:

In addition for the frwiktionary list, why "1", "0", "-a" and "10" are listed when they already exist on the French Wiktionary?

the answer is: I don't know, and thank you for noticing this. I will have to inspect this and find out where is the catch.

@Pamputt the Main_Page and all its lower/upper case variants are removed.

In respect to your second question

... why "1", "0", "-a" and "10" are listed when they already exist on the French Wiktionary?

currently, these entries are not found in the page table of the frwiktionary database. That is the reason why the dashboard believes that you need to have these items on your Wiktionary. Now, there is little I can do about it: as a data scientist I work with the data sets that are generated on behalf of our developers and data engineers. I will switch the system back to work with the cognate database in the following days, but I do not know if the problem persists there or not. If it does reoccur, we will need to contact the people who generate these SQL databases and point their attention to this. Once again, thank you for you comments and your kind words.

MarcoSwart added a comment.EditedJul 9 2018, 8:28 PM

@GoranSMilovanovic Thanks a lot! The "I miss you"-page is what I hoped for. Would it be possible to give the dashboard a format that could be transcluded in a Wiktionary page?

@GoranSMilovanovic very nice work.
In addition to Lea's request, would it be possible to list all the articles in a given language that exist on another Wiktionary edition but not in the edition of this language. For example, I would be nice to have a list of all French words that exist in a Wiktionary editions, but not in the French Wiktionary.

This would be a useful option too.

@MarcoSwart Hi. You are welcome!

Would it be possible to give the dashboard a format that could be transcluded in a Wiktionary page?

I am not really sure how to accomplish that. The dashboard is developed in R and RStudio Shiny. The later is an R framework for reactive design that sits on top of JS Bootstrap. An <iframe> would do, I guess :) - Sounds more like a task for MediaWiki people.

This would be a useful option too.

What would be a useful option too? : ) Could you please explain in more detail what functionality would you like to have that goes beyond the I miss you page? Thank you!

@GoranSMilovanovic I was referring to the additional request by Pamputt. It could be realised as a kind of filter on the results of "I miss you" showing only links to pages that contain a word in the language of the selected wiktionary. But I don't know whether this information is accessible in the Cognate database, because the interwikilinks only need the page title, not the page content.

The My Wiktionary tab is interesting. Presently it only shows the absolute number of links. To me it would be useful if this was also expressed as a percentage of the number of "Good pages" in [http://wikistats.wmflabs.org/display.php?t=wt this table]: sometimes a high number is mainly the result of the target wiktionary having many pages, sometimes it really has to do with a strong overlap between projects, which would be indicated by differences in the percentages.

The "Hub tabs" are a little overcrowded. I suspect that most of the lines all are connected to the same top or bottom wiktionaries by size. Maybe it would be more informative if we could adjust for the size of the wiktionaries involved. What I forgot in my earlier remarks: wiktionaries are not just different in size, but they may also differ in the number of pages without any wikilink. On the other hand: although it is nice to know, I don't see a real practical use for this type of information anyway. Maybe we should just enjoy the pictures.

@MarcoSwart

Maybe it would be more informative if we could adjust for the size of the wiktionaries involved.

We can do that. By what parameter would you suggest to adjust the size of the nodes?

wiktionaries are not just different in size, but they may also differ in the number of pages without any wikilink.

That is exactly what the size of the nodes on Hubs and Anti-hubs represents in the present version of the dashboard (read carefully the description on the top of the page).

Thanks for your suggestions.

Finally the Links Dataset. I understand why the lines are duplicated. Unfortunately the search function operates on all columns, making it difficult to create useful lists.

  1. Assuming that the search-function is able to discern interpunction symbols, would it be possible to add a colon to all the language codes in the first column and a period to the codes in the second column (e.g. en: and en.)? It would still be possible to search on the present two or three letter codes, but by adding the colon or the period you could select "clean" lists.
  2. To expand on this proposal: some of the two letter codes are part of the three letter codes. So it is impossible to select 'an' without selecting 'ang' en 'zh_min_nan' too. Using a semicolon and comma for codes having over two letters would create the possibility to select an: without selecting 'zh_min_nan;'

@GoranSMilovanovic, thanks for your replies. I now have a better understanding of the hub-tabs. After some pondering I feel the most useful parameter would be: the number of links divided by the total number of links for the target wiktionary. In connection with this approach, it would be better not to add a percentage of the total number of pages for the target wiktionary but present the percentage of the total number of links instead. So I amend my original suggestion made Jul 9, 11:03 PM. This would make the different presentations more consistent with one another. The amount of pages without any wikilinks has often to do with specific choices made by a wiktionary, like putting flected forms or different spelling or script on separate pages or not. These choices are not very relevant for interwikilinks and when comparing numbers of interwikilinks I expect more meaningful results when we ignore them.

GoranSMilovanovic added a comment.EditedJul 14 2018, 11:37 PM

@Addshore Please note: in the cognate_wiktionary database,

  • in the cognate_titles table, the cgti_normalized_key is not unique; after loading the table to R
length(cognateTitles$cgti_normalized_key)
[1] 18372021

but

length(unique.integer64(cognateTitles$cgti_normalized_key))
[1] 18366887
  • on Saturday, July 14, late evening hours, I've found the following anomaly in the cognate_pages table: an entry entitled utställningsårens, which translates as "exhibition years" from Sweden (or at least Google Translate says so), had 14092 appearances, raw_key is -4503413649379509.
GoranSMilovanovic added a comment.EditedJul 17 2018, 10:50 PM

@Pamputt @MarcoSwart Please let me summarize the current developments here.

@Pamputt

would it be possible to list all the articles in a given language that exist on another Wiktionary edition but not in the edition of this language.

If you mean a list of entries not present in some particular Wiktionaries, that is already found on the My Wiktionary tab. If you mean the possibility to show what entries are present in, say, Russian Wiktionary, but not available in French, and so for all language pairs: that is not possible at the moment. Theoretically we can do it, however, the implementation of that feature would call for a much more resourceful Dashboard back-end, and we must be savvy in that respect.

for the frwiktionary list, why "1", "0", "-a" and "10" are listed when they already exist on the French Wiktionary?

In the end, I have learned that this is an anomaly related to frwiktionary only. Given that this is an isolated case and not a consequence of a systematic error in our data acquisition procedures, I will wait until I learn more from our data engineering before I address this issue.

@MarcoSwart

Finally the Links Dataset. I understand why the lines are duplicated. Unfortunately the search function operates on all columns, making it difficult to create useful lists.

The implemented solutions is the following one and the changes are visible on the (1) My Wiktionary, and the (2) Links Dataset tabs:

  • All language editions now have wiktionary back in their names, so that en is now enwiktionary, as well as zh is zhwiktionary, etc; also, everything in the Target column begins with "> ", e.g. "> enwiktionary", "> zhwiktionary", to enable more control of the search functionality;
  • The table on the Links Dataset tab now has column specific filters implemented in the header - making it easy to produce a particular Wiktionary specific data set.
  • The full links dataset from the Links Dataset tab can be downloaded as .csv;
  • Any wiktionary-specific links datset can be downloaded from the My Wiktionary tab.

Also, the dataset generated on the I Miss You tab can also be downloaded now.

Finally, as of

After some pondering I feel the most useful parameter would be: the number of links divided by the total number of links for the target wiktionary. In connection with this approach, it would be better not to add a percentage of the total number of pages for the target wiktionary but present the percentage of the total number of links instead. So I amend my original suggestion made Jul 9, 11:03 PM. This would make the different presentations more consistent with one another. The amount of pages without any wikilinks has often to do with specific choices made by a wiktionary, like putting flected forms or different spelling or script on separate pages or not. These choices are not very relevant for interwikilinks and when comparing numbers of interwikilinks I expect more meaningful results when we ignore them.

I really need to read through this a few more times in order to see if I understand the nature of the desired feature at all. Stay in touch.

@Lea_Lacroix_WMDE Preparations to put this Dashboard on regular updates in production: T199851

@GoranSMilovanovic Hey Goran, sorry for not reviewing this before. This looks amazing!

I'd suggest to start working on a documentation page, so people who are not one of those who requested these features, can also understand what it is about. I'm currently at the Wikimania hackathon running the documentation corner, so I'd have time in the 2 next days to help you with that.

Also, is there a way that volunteers help you translating the interface of the board in their language? I expect some communities (like the French, but not only) to have the wish to get the board in their language.

@Lea_Lacroix_WMDE Thank you for your kind words, Lea!

I'd suggest to start working on a documentation page, so people who are not one of those who requested these features, can also understand what it is about.

No worries, I'm on it. Our data products receive two documentation pages, both on Wikitech: a "user-manual oriented" one, and a purely technical documentation. It shouldn't take me too much time to have these two produced for the Cognate Dashboard.

I'm currently at the Wikimania hackathon running the documentation corner, so I'd have time in the 2 next days to help you with that.

Thank you. I will feel free to ask for a review of the documentation pages once I have them completed.

Also, is there a way that volunteers help you translating the interface of the board in their language? I expect some communities (like the French, but not only) to have the wish to get the board in their language.

Any volunteer help will be highly appreciated in that respect.

@GoranSMilovanovic Great!
My question about translation was not the clearest, sorry. I meant: how can people suggest some translations for the interface? Is there a way that they can do it on their own, like on translatewiki for the Wikimedia projects, or they have to contact you and provide the content?

@Lea_Lacroix_WMDE They have to contact me. Even better, when I complete the Dashboard documentation, they can suggest the translations for the interface there.

@GoranSMilovanovic If you put the documentation on a wikipage, it will be easily translatable indeed :) Since it's not really Wikidata-related, I'd suggest Meta for this one.

Pamputt added a comment.EditedJul 18 2018, 4:55 PM

@Pamputt

would it be possible to list all the articles in a given language that exist on another Wiktionary edition but not in the edition of this language.

If you mean a list of entries not present in some particular Wiktionaries, that is already found on the My Wiktionary tab. If you mean the possibility to show what entries are present in, say, Russian Wiktionary, but not available in French, and so for all language pairs: that is not possible at the moment. Theoretically we can do it, however, the implementation of that feature would call for a much more resourceful Dashboard back-end, and we must be savvy in that respect.

Let me explain with some examples
Let us say that I am interested in by the French Wiktionary. What I would like to know is all the French entries that are in the Russian Wiktionary and not yet present in the French Wiktionary. So for example is AAA is a French word but there is not yet any article on the French Wiktionary for this word, I would like to have a list of the other Wiktionaries that have this word in French (I want to know for the French in particular) So if AAA is present in the Russian Wiktionary for describing a word in German, I do not want to have this word in my list.
So basically, from My Wiktioanry tab, if I cnsider the pair frwikt and ruwikt, I would like to have the list of all the articles that have already an article on ruwikt, but not yet on the fr.wikt (currently the My Wiktionary only gives me a number, not a list of all these articles). And once I have the list, I would like to be able to filter to have only the French articles from ruwikt that do not have an article on frwikt.
So what I would like is probably closer "I miss you" than "My Wiktionary".
In "I miss you", it would mean to add a filter to select only the pages from ruwikt that have an entry in French (and so hide all the articles that are not described in French).
I hope it is clearer now, otherwise, ask me more details.

for the frwiktionary list, why "1", "0", "-a" and "10" are listed when they already exist on the French Wiktionary?

In the end, I have learned that this is an anomaly related to frwiktionary only. Given that this is an isolated case and not a consequence of a systematic error in our data acquisition procedures, I will wait until I learn more from our data engineering before I address this issue.

Ok let us see what the data engineer say. Thanks for the investigation.

@Lea_Lacroix_WMDE They have to contact me. Even better, when I complete the Dashboard documentation, they can suggest the translations for the interface there.

It is not very clear how I can propose translation for https://wdcm.wmflabs.org/Wiktionary_CognateDashboard/ here.
Am I supposed to copy all the strings that appear in the interface and to propose a translation for each of them. Is there any place where we can see all the translatable strings?

@Pamputt in order to check the French words existing on other Wiktionaries, I feel like we'll run into the same issues as for T150841, the fact that the Wiktionaries don't structure their language titles in the same way.

@Pamputt @Lea_Lacroix_WMDE

The Dashboard now has a Compare tab. The users needs to select a source and a target Wiktionary, and the click the Generate button to produce a data set. For large Wiktionaries this will take a while, but a progress bar is implemented in the lower right corner to provide user feedback. The output is a table with all entries found in the Target but not in the Source Wiktionary. @Pamputt This is what I can currently do in response to T166487#4434776. I think we would certainly face many problems in an attempt to additional filter for only those entries that are given in the Source language from the Target Wiktionary - I am not even aware if we have a suitable data set for such filtering.

Once in production (waiting for T199851), the Dashboard will need to download the files for comparison to client-side. The prototype that you can test makes use of this files locally, but in production that will not be possible. In effect, that means that the users of the Compare functionality with slow Internet connections will have to be really patient. I have already mentioned that comparisons of this sort lead to complications on the Dashboard back-end. However, I fully understand the importance of this functionality. The current implementation is the fastest one that we can have right now. For small to moderately sized Wiktionaries it delivers in a flick of an eye; for Wiktionaries with many millions of entries, well... computers need some time to load and compute the data.

Let me know what you think.

GoranSMilovanovic added a comment.EditedJul 23 2018, 1:52 PM

@Pamputt @Lea_Lacroix_WMDE

Additionally, the comparison table generated by on the Compare tab now has URLs links towards the respective entries in the Target Wiktionary that are missing from the Source Wiktionary. I thought this would help the work of those users who would user the Dashboard to pick up the missing entries from other Wiktionaries.

@GoranSMilovanovic I tried to test the Compare tab with frwiktionary as a source with some troubles. For example, with frwiktionary as source and aawiktionary as target, I get the following error message:

An error has occurred. Check your logs or contact the app author for clarification.

Comparing frwiktionary and cawiktionary, I get as first result the word "??". This article exists on cawiktionary but the link given by Compare redirects on the main page of cawiktionary. Could you have a look on what is going wrong?

Pamputt added a comment.EditedJul 23 2018, 6:16 PM

For big Wiktionaries comparison, I do not manage to generate the list. For example if I select frwiktionary as a source and enwiktionary as a target, I systematically get a message on the bottom left saying "Disconnected from the server. Reload". Sometime I get a popup message withe following message (during the "Compring Wiktionaries" step):

DataTables warning: table id=DataTables_Table_8 - Ajax error. For more information about this error, please see http://datatables.net/tn/7

Is it expected?

If I compare frwiktionary (source) and dewiktionary (target), the first link is "-". There is a problem because frwiktionary as also an entry for "-". Could you check?

@Pamputt Thanks for testing!

For example, with frwiktionary as source and aawiktionary as target, I get the following error message: An error has occurred. Check your logs or contact the app author for clarification.

Indeed, because it turns out that aa.wiktionary.org has been closed and is empty. However, the corresponding database is not removed yet, so I will have to remove it "manually" from all future updates.

Comparing frwiktionary and cawiktionary, I get as first result the word "??". This article exists on cawiktionary but the link given by Compare redirects on the main page of cawiktionary. Could you have a look on what is going wrong?

Fixed.

For big Wiktionaries comparison, I do not manage to generate the list. For example if I select frwiktionary as a source and enwiktionary as a target, I systematically get a message on the bottom left saying "Disconnected from the server. Reload". Sometime I get a popup message withe following message (during the "Compring Wiktionaries" step): DataTables warning: table id=DataTables_Table_8 - Ajax error. For more information about this error, please see http://datatables.net/tn/7 Is it expected?

Well, I would say the most appropriate answer would be to say: it is not unexpected :). First of all: never mind the exact content of these error messages - they all indicate one the same thing. Namely: the comparisons of large Wiktionaries takes a lot of time, and while it happens, it is possible that the Shiny Server - the technology we use to host this dashboard - simply kicks the user out of the session because of inactivity. I will see what I can do to extend the Shiny Server patience in that respect. However, it will not happen always even under the current settings (it happened only once to me, while I have performed tens of tests with large Wiktionaries). So please, until I let you know here that have I managed (or not) to change the respective server parameters, and if you need to work with this Dashboard immediately, if you encounter these error messages again just reload the Dashboard and try again. In the meantime, I will do my best to eliminate the cause of the problem.

If I compare frwiktionary (source) and dewiktionary (target), the first link is "-". There is a problem because frwiktionary as also an entry for "-". Could you check?

Yes, there is a problem. I will need some time to find out exactly why does this happen.

Again: thank you very much for testing.

GoranSMilovanovic added a comment.EditedJul 23 2018, 7:40 PM

@Pamputt Could you please test again in respect to the following:

For big Wiktionaries comparison, I do not manage to generate the list. For example if I select frwiktionary as a source and enwiktionary as a target, I systematically get a message on the bottom left saying "Disconnected from the server. Reload". Sometime I get a popup message withe following message (during the "Compring Wiktionaries" step): DataTables warning: table id=DataTables_Table_8 - Ajax error.

With the open (free) version of the Shiny Server we cannot change the session timeout parameter. However, I've managed to find a trick that should do. The session timeout is now extended to ridiculous ten minutes (no comparison will ever take so much time) and the problem should not resurface.

NOTE. In addition, the DataTables warning can be just ignored - I encountered the problem (only once), and after closing the error message pop-up the Dashboard has successfully completed the comparison and rendered the results.

With the open (free) version of the Shiny Server we cannot change the session timeout parameter. However, I've managed to find a trick that should do. The session timeout is now extended to ridiculous ten minutes (no comparison will ever take so much time) and the problem should not resurface.

Actually, I still experience the disconnection. I have no idea where the problem may come from. I just see that I see the disconnection when I come back to the tab where Dashboard runs. Yet, I cannot reproduce every time so not sure that is related.

Pamputt added a comment.EditedJul 23 2018, 10:19 PM

If I compare frwiktionary (source) and dewiktionary (target), the first link is "-". There is a problem because frwiktionary as also an entry for "-". Could you check?

I have a similar problem with frwiktionary (source) and eowiktionary (target) with the word "?". It exists on eowiktionary and also on frwiktionary.

The same with 0 (eo) and 0 (fr) (and also with "1")
Also with -a (eo) and -a (fr)

@Pamputt Don't worry, that problem is already fixed. You will not able to see the changes before tomorrow because I yet need to move some 1.2Gb of pre-processed data sets to the Dashboard prototype's back-end that you are testing (do not ask me why, a technical thing; as I've mentioned, the production version will be downloading these data sets separately). I think that the fix will also cover the previously encountered, similar problem on the I miss you tab.

Thank you very much for the tests that you have conducted!

@Pamputt Both the I miss you tab and the Compare tab are now fixed in respect to the reported problems. Please test when you find some time and let me know does everything fall in place now. Thank you.

As of the disconnections from the server, as well as the fact that the Compare tab in general takes a lot of time to complete its operations for large Wiktionaries: I will see what I can do, but it will be a long run.
For now, given the obvious importance of this feature for Wiktionary editors and the complexity related to its implementation, the goal was set to simply make it work. I would suggest always downloading the comparison table once it is generated and then working with it locally from a spreadsheet software like Libre Calc or similar.

@Lea_Lacroix_WMDE Please review the Dashboard prototype which is (hopefully) in its final pre-release version now. I am on the documentation until then. Thanks.

Once again, as soon as the T199851 review gets done, the Dashboard goes on regular updates.

Hey @GoranSMilovanovic, I went through all the pages of the board, everything sounds fine. Before publishing officially, the next steps would be:

  • provide documentation on meta.wikimedia.org (if you want to share a draft with me beforehand, so I can help you, feel free to! Also if you need help with setting up the meta page and the translation tags.)
  • provide the content paragraphs of the board on another wikipage, so people can help you translating (same comment as above)

Thanks a lot for your work :)

@Pamputt @Lea_Lacroix_WMDE I have implemented a test version of the Compare tab which should be able to significantly (e.g. orders of magnitude faster) improve its efficiency for large Wiktionaries. The feature is not in production yet. I will ping here once it is integrated and online and ask for tests. Thank you.

@Lea_Lacroix_WMDE it is possible that everything for this Dashboard will be ready tomorrow, contrary to my previous prediction that the final touch will take the whole week.

@Lea_Lacroix_WMDE @Pamputt The implementation of the new Compare routine is underway. The Compare tab of the Wiktionary Cognate Dashboard will not be operational for several hours. You will be notified once the changes take effect.

No worries @GoranSMilovanovic, and thanks for letting us know :)

@Lea_Lacroix_WMDE Ok, this will take a bit longer than expected. A complicated trade-off in speed of comparison for small and large Wiktionaries is involved and I need to find the optimal solution. As soon as this is done (today?) I'm implementing the multilingual version (instructions), help links (documentation, etc) and then we can safely go online. Stay in touch.

@Lea_Lacroix_WMDE The new comparison operation is now implemented. However, an in spite of all the work that I've put into it, it provides for a negligible speed up only. It is a bit more robust (i.e. less error prone) than the previous one.

Next steps:

  • implementing multi-lingual instructions
  • documentation links on dashboard
  • announcement.

Given that the statistics/data engineering machinery is completed, I am closing this task as resolved.
What remains is to take care of T200197 which will take some time.

GoranSMilovanovic closed this task as Resolved.Aug 9 2018, 1:30 AM

@GoranSMilovanovic, I tested the "new" version of the Dashboard prototype. I compared enwikt (source) and frwikt (target) and all the words that are listed point to https://fr.wiktionary.org/wiki/ (and not to https://en.wiktionary.org/wiki/xx). In addition, I still get disconnection after 2 or 3 minutes.

Noe added a comment.Aug 9 2018, 9:02 AM

Hi @GoranSMilovanovic!
Great job here! I am keeping an eye on your project and I am very please to see it going out! As part as the writers of French Wiktionary Actualités, the monthly journal for lexico-lovers, I'll be please to write something about this tool in August edition (to be out in September 1st). Or maybe Pamputt will write it :)
It will be here: https://fr.wiktionary.org/wiki/Wiktionnaire:Actualités/Brouillon
It is written in French first and then translated to English. At some point, I'll be happy to ping you to proof-read our article. Let us know the best timing for you to have a press announcement!

Cheers

Hey @Noe, the dashboard should be officially out and announced at that time, and I'd be happy to help you with writing or reviewing if needed :)

GoranSMilovanovic reopened this task as Open.Aug 9 2018, 11:44 AM

I have re-opened this because I can see this is where you've decided to have a char on this Dashboard :)

@Pamputt

I compared enwikt (source) and frwikt (target) and all the words that are listed point to https://fr.wiktionary.org/wiki/ (and not to https://en.wiktionary.org/wiki/xx).

Well, yes: as the instructions (top of page) explain: The Dashboard will generate a table of all entries found in the Target, but not in the Source Wiktionary. If you want it the other way around, then source should be frwiktionary, and target should be enwiktionary.

In addition, I still get disconnection after 2 or 3 minutes.

I understand. The dashboard is very heavy on the front-end. For example, when you create a comparison of two large Wiktionaries, say enwiktionary and frwiktionary, it needs to generate a table of several million rows. In order to avoid your work being interrupted, the option to download the table after it gets generated is provided on the Compare tab. It is strongly advised to generate large comparisons from the Dashboard, and then download the .csv file and continue the work in any spreadsheet software (e.g. Libre Calc, MS Excel, anything similar - they all deal with .csv files easily). Following the disconnection, the only thing left for the user is to reload the Dashboard. With the current technology used at the back-end and following several (time consuming!) experiments in order to discover the optimal solution, I can say that what we have now is simply the best that we can have (now; one day, in the long run, a fast-serving layer or a different front-end technology might be implemented here to make the Dashboard more robust).

@Noe

At your service. When you have the article in English (unfortunately, I am not a French speaker; I am re-learning Italian for some time already while French is set as my next goal), please share it with me. If you need any help or input from me, do not hesitate to contact me: goran.milovanovic_ext@wikimedia.de. As of the timing of the announcement, whatever is good for you is good for me. The Wiktionary Cognate Dashboard in its current version needs just one additional touch in relation to T200197 and @Lea_Lacroix_WMDE and I are taking care of it.

@GoranSMilovanovic thank you very much for the detailed explanations. Still few remarks

  1. In the "Compare" tab, if I click on "Download (csv)" just before clicking on "Generate", I arrive on an error page saying that there is no such file. From an ergonomic point of view, I think this button should be disabled while the user has not yet click on the "Generate" button.

I compared enwikt (source) and frwikt (target) and all the words that are listed point to https://fr.wiktionary.org/wiki/ (and not to https://en.wiktionary.org/wiki/xx).

What I meant is when I select enwikt as source) and frwikt as target, what I should get is a list of entries that are present in frwikt and not in enwikt. The two first entries are "xx" and "µ°C". Two problems

I hope my message is clearer now (I mixed between frwikt en enwikt in my previous message).

  1. Concerning the disconnection, maybe you should probably add a warning message saying that it may happen and that the user should download the csv file for working on it.

@Pamputt As ever, thank you very much for all the tests that you are providing here. Let's see:

In the "Compare" tab, if I click on "Download (csv)" just before clicking on "Generate", I arrive on an error page saying that there is no such file. From an ergonomic point of view, I think this button should be disabled while the user has not yet click on the "Generate" button.

You are absolutely right and I will take care to provide some user feedback on this.

What I meant is when I select enwikt as source) and frwikt as target, what I should get is a list of entries that are present in frwikt and not in enwikt. The two first entries are "xx" and "µ°C". Two problems

All the links to frwikt are https://fr.wiktionary.org/wiki/ (and not https://fr.wiktionary.org/wiki/xx or https://fr.wiktionary.org/wiki/µ°C as expected)

Correct observation, and needs to be corrected in the Dashboard. I wonder what have I done to break the functionality that was once doing nice here. It must be something that I did during the implementation of the new compare routine. I will check this out and correct the bug.

"xx" is listed while it exists on enwikt and on frwikt

No, it does not exist on enwiktionary. Be carefull: "××" is not the same as "xx". As of "××" and its existence on enwiktionary, please check out this page.

Concerning the disconnection, maybe you should probably add a warning message saying that it may happen and that the user should download the csv file for working on it.

Yes. Will do.

Thanks a lot!

Pamputt added a comment.EditedAug 13 2018, 3:19 PM

I have another request for this dashboard. Would it be possible to add a new tab (or to add within an existing tab), a list of the page that are the most linked over all the Wiktionaries? This is a kind of opposite of "I miss you". For example, we will have a list with "a" existing on 97 Wiktionaries ; "and", present on 80 Wiktionaries, etc. The list will be sorted from the entries that are the most linked to the ones less linked.

Creating such list may allowed to create list by language for the most popular words in every languages that then may sub-sorted in most popular words by language.

@Pamputt This dashboard needs to go officially live today. I will review this request and if it makes sense, some future version of the dashboard will encompass it. Not in the too distant future, to make myself clear.

Also, I do not really understand the following:

Creating such list may allowed to create list by language for the most popular words in every languages that then may sub-sorted in most popular words by language.

Please clarify the request. Thanks!

Wiktionary contributors might be interested to know which entries are the most popular so as to improve these pages first (add pronunciation, synonyms, etc). Since these entries exists on a large number of Wiktionaries, they might find a lot of information to add trough all the Wiktionary editions.

This list might also be used by Lingua Libre in order to identify which "popular" words do not have yet any audio pronunciation.

What would be even more interesting would be to have a list sorted by language, but this is the same problem as the one mentionned by Lea previously.

@Pamputt Understood. Ok, what I can do (quickly, I guess) is to provide a table of the most popular Wiktionary entries, sorted by the number of Wiktionaries that have a page for them, and above some usage threshold (say, all entries used on more than 10, or 50, or so Wiktionaries - otherwise, we will be generating a table of some 18M rows on the client side...). I will be reporting back here as the development progresses in this respect.

@Lea_Lacroix_WMDE Everything is ready to announce the Wiktionary Cognate Dashboard. Please also check T200197#4499861.

@Pamputt All bugs discussed in T166487#4492394 are taken care of. For T166487#4498814 - well, you will have to give me some time.

@Pamputt Thanks for your request. Feel free to create separate tickets for the new features that were not in the initial list. We won't be able to make them happen immediately. We will need to prioritize and estimate the amount of time that is needed for such new features.

In the meantime, the first version of the board is ready to be announced :) I'll take care of this today.

GoranSMilovanovic closed this task as Resolved.Aug 14 2018, 3:00 PM

Closing the task as resolved. Everyone interested is encouraged to open separate tickets for bugs and feature requests. The requests will be prioritized as @Lea_Lacroix_WMDE has explained.