Page MenuHomePhabricator

Data set review for the Wiktionary Cognate Dashboard
Closed, ResolvedPublic

Description

We need a review and approval before making the data sets that support the Wiktionary Cognate Dashboard available publicly from /srv/published-datasets on stat1005.

The data sets that need a review are currently found in: /home/goransm/RScripts/Wiktionary/Wiktionary_CognateDashboard/

NOTE: Only the .csv files described in the README.txt document (found in the same directory) will be made public and thus need to be reviewed.

NOTE: None of the files contain any private data: merely aggregate statistics and results of statistical modeling of Wiktionary projects and their Cognate extension database.

Thank you.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2018, 12:46 AM

Ping, Analytics can anyone please take a quick look at this - it is really a simple public dataset review that will take no more than a flick of an eye to complete - and we need to put the machinery that will use these data online? Thanks a lot!

Milimetric triaged this task as Medium priority.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.

I reviewed the files. They look ok. In general if you're just analyzing content, in this case what articles are available in different wiktionaries, then it doesn't need to be reviewed before publishing. Only if you start mixing in any data that's not otherwise public. But all of the data in your analysis could be obtained from public databases, right?

GoranSMilovanovic added a comment.EditedAug 6 2018, 1:15 AM

@Milimetric Thank you very much.

In general if you're just analyzing content, in this case what articles are available in different wiktionaries, then it doesn't need to be reviewed before publishing.

In most cases, I would say 99%, my work encompasses analyzing content in the above described sense or similar.

Only if you start mixing in any data that's not otherwise public.

I guess things like user analytics where some fields need to be anonymized or reported only upon aggregation... I perform these types of analytics too, e.g. campaign evaluations for WMDE, but typically I do not need public datasets generated in production for such cases.

But all of the data in your analysis could be obtained from public databases, right?

I'd say yes.

Thanks again, Dan.

GoranSMilovanovic closed this task as Resolved.Aug 7 2018, 10:17 AM