Page MenuHomePhabricator

Provide Hatstall to fix syncing / updating of identities on the wikimedia.biterg.io production instance
Closed, ResolvedPublic

Description

Upstream ticket: https://gitlab.com/Bitergia/c/Wikimedia/support/issues/18

I merged https://github.com/Bitergia/mediawiki-identities/commit/50ab30725ea9d6eb03487bb9eb1849965a4d8d1c on 2017-02-09 for T157569. That change added numerous additional organizations, plus enrolled nearly 500 accounts into those orgs.

Today on 2017-02-12, all four data sources on https://wikimedia.biterg.io/app/kibana#/dashboard/Data-Status state "Last Retrieval: February 10th, 2017".

However going to https://wikimedia.biterg.io/app/kibana#/dashboard/Git I see none of those additional organizations listed.

[How often] Is the data in https://github.com/Bitergia/mediawiki-identities/ deployed onto wikimedia.biterg.io?
How could I find out that some DB version has been deployed?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • there are also some orgs that we don't have in our DB, such as "Debian GNU/Linux". Looks like some configuration change which is a regression of T100189#1580919 ? Do you want to track reverting that in another task?
  • "Last Attracted Developers" on https://wikimedia.biterg.io/app/kibana#/dashboard/Git-Demographics also looks funny - it says I started contributing today and has some authors listed without any "first commit date". Again, do you want to track that in another task?

Whatever is better for you. I will report it also internally so we can start tracking how to solve it :)

Whoops, our fault, yes. bots.wmflabs.org is now wm-bot.wmflabs.org (as per T159734).

Great! we will start the retrieval asap.

Thanks for pushing it!

  • there are also some orgs that we don't have in our DB, such as "Debian GNU/Linux". Looks like some configuration change which is a regression of T100189#1580919 ? Do you want to track reverting that in another task?

Split into T161308: Only display organizations defined in Wikimedia's DB (disable assuming orgs via hostnames in email addresses).

Split into T161309: Git's "Last Attracted Developers" lists established developers and developers without a First Commit Date.

I merged https://github.com/Bitergia/mediawiki-identities/commit/50ab30725ea9d6eb03487bb9eb1849965a4d8d1c on 2017-02-09 for T157569. That change added numerous additional organizations, plus enrolled nearly 500 accounts into those orgs.

We've re-generated the database and migrate to our latest software version, so the changes should be all applied now.

Today on 2017-02-12, all four data sources on https://wikimedia.biterg.io/app/kibana#/dashboard/Data-Status state "Last Retrieval: February 10th, 2017".

We are trying to keep data up to the date as much as I can. We've struggled a bit due to some version migration and applying bugfixes, so I hope since now on data is updated every day.

However going to https://wikimedia.biterg.io/app/kibana#/dashboard/Git I see none of those additional organizations listed.

They should be now available. I've checked few of the identities in the file and users are properly affiliated in the database.

[How often] Is the data in https://github.com/Bitergia/mediawiki-identities/ deployed onto wikimedia.biterg.io?

Nowadays this is not an automatic process.

How could I find out that some DB version has been deployed?

Actually I think just by asking us. But it shouldn't be a big deal to update it every time a change is done in the repository!

I merged https://github.com/Bitergia/mediawiki-identities/commit/50ab30725ea9d6eb03487bb9eb1849965a4d8d1c on 2017-02-09 for T157569. That change added numerous additional organizations, plus enrolled nearly 500 accounts into those orgs.

We've re-generated the database and migrate to our latest software version, so the changes should be all applied now.

@Albertinisg: Hmm, I'm afraid not...? See user with the ID 3dfcaa955728028d150de81cf3e10a003b39bef2 and their organization in wikimedia-affiliations.json.
Now search for that user only: https://wikimedia.biterg.io:443/goto/eb5fd6a8cac70ce8bca3526d7c030365 will incorrectly display "Organization: Independent".
Now search for that org only: https://wikimedia.biterg.io:443/goto/d602418725c536c5ab50e1b6aedc89e7 will display no results.

Or going to https://wikimedia.biterg.io:443/goto/063b6ccbfaa28126733d154f572712c3 (filter: Org=Independent) and looking at the list of "Submitters" I see Hashar in that list, but uuid 024c5043e1c9d13f31c4b1d69d3be47c86c03d0b in wikimedia-affiliations.json says "organization": "Wikimedia Foundation". Same for Ejegg (07c18c8e0323f2b846510b0c63aa5184ad1af1a8), or jforrester (0848fcd3d184007080330da369b363292812c126).

[How often] Is the data in https://github.com/Bitergia/mediawiki-identities/ deployed onto wikimedia.biterg.io?

Nowadays this is not an automatic process.

Are there plans to make it an automatic process, and are / should such plans be in a task?

I merged https://github.com/Bitergia/mediawiki-identities/commit/50ab30725ea9d6eb03487bb9eb1849965a4d8d1c on 2017-02-09 for T157569. That change added numerous additional organizations, plus enrolled nearly 500 accounts into those orgs.

We've re-generated the database and migrate to our latest software version, so the changes should be all applied now.

@Albertinisg: Hmm, I'm afraid not...? See user with the ID 3dfcaa955728028d150de81cf3e10a003b39bef2 and their organization in wikimedia-affiliations.json.
Now search for that user only: https://wikimedia.biterg.io:443/goto/eb5fd6a8cac70ce8bca3526d7c030365 will incorrectly display "Organization: Independent".
Now search for that org only: https://wikimedia.biterg.io:443/goto/d602418725c536c5ab50e1b6aedc89e7 will display no results.

Or going to https://wikimedia.biterg.io:443/goto/063b6ccbfaa28126733d154f572712c3 (filter: Org=Independent) and looking at the list of "Submitters" I see Hashar in that list, but uuid 024c5043e1c9d13f31c4b1d69d3be47c86c03d0b in wikimedia-affiliations.json says "organization": "Wikimedia Foundation". Same for Ejegg (07c18c8e0323f2b846510b0c63aa5184ad1af1a8), or jforrester (0848fcd3d184007080330da369b363292812c126).

Ok I see the issue here. This should be fixed when we remove the organizations we loaded by mistake and fix the affiliations.

[How often] Is the data in https://github.com/Bitergia/mediawiki-identities/ deployed onto wikimedia.biterg.io?

Nowadays this is not an automatic process.

Are there plans to make it an automatic process, and are / should such plans be in a task?

Yes, but I can't tell when yet. For this purpose and to avoid conflicts with the new database, I will stop the database updating of the legacy dashboard (the one pushing changes into github).

According to Bitergia, T157898 and T161235 (and to some extent T157709 though half of that is T161308) are side effects of an older version as the DB is not synced in an automatic way. This should get fixed within the next 2-3 weeks.

@Aklapper the issue itself should be fixed now. However, as we are still not updating the database automatically, I think we can keep the issue open, or close it and continue in T161235: https://wikimedia.biterg.io shows 2017 contributors who are not listed in mediawiki-identities/wikimedia-affiliations.json, as we are expecting there the same behavior. What do you think?

Aklapper renamed this task from Updated data in mediawiki-identities DB not deployed onto wikimedia.biterg.io? to Automatically sync mediawiki-identities/wikimedia-affiliations.json DB dump file with the data available on wikimedia.biterg.io.May 5 2017, 2:24 PM

@Albertinisg: Yay, thanks. I'm going to do it the other way round: Edit the task summary here and merge T161235 into this task. :)

This might not be related at all, but...
last update in the file https://github.com/Bitergia/mediawiki-identities/wikimedia-affiliations.json was on April 7th.
Taking that file and importing it into a fresh DB in MariaDB on Fedora 25 via sortinghat load wikimedia-affiliations.json, all uuid's get changed, a la

+ 00000ba7f563234e5f239e912f2df1521695122e (old fa0f23d092d9f963b13b7393633ffc905daabd60) loaded

So now I manually need to find all id's I wanted to update, and first manually update the sortinghat commands I was about to run to update the DB.
Meh. I hope that does not happen always now. And I wonder what's the reason. And if I should just go ahead and push my changes into the GitHub file (once I've updated the uuid's in the list of sortinghat commands to change entries in the DB).

And (back to topic) generally speaking, I'm wondering what's the plan forward here. As the current situation makes it hard to update info on wikimedia.biterg.io of some community members (such as affiliation) in a timely manner for me, which slightly reduces the value of info on wikimedia.biterg.io about recent/new contributors.

This might not be related at all, but...
last update in the file https://github.com/Bitergia/mediawiki-identities/wikimedia-affiliations.json was on April 7th.
Taking that file and importing it into a fresh DB in MariaDB on Fedora 25 via sortinghat load wikimedia-affiliations.json, all uuid's get changed, a la

+ 00000ba7f563234e5f239e912f2df1521695122e (old fa0f23d092d9f963b13b7393633ffc905daabd60) loaded

So now I manually need to find all id's I wanted to update, and first manually update the sortinghat commands I was about to run to update the DB.
Meh. I hope that does not happen always now. And I wonder what's the reason. And if I should just go ahead and push my changes into the GitHub file (once I've updated the uuid's in the list of sortinghat commands to change entries in the DB).

Hmmm I'm afraid you are using the latest sortinghat version, which have differences between the older versions: https://github.com/grimoirelab/sortinghat#compatibility-between-versions.

If so, you can do the following:

  • Install an old Sortinghat version (for example 0.2.0) and import the Github file in a clean database
  • Perform the changes you need to (for the uuid's compatibility)
  • Export the identities into a file and push it to the Github repository

And (back to topic) generally speaking, I'm wondering what's the plan forward here. As the current situation makes it hard to update info on wikimedia.biterg.io of some community members (such as affiliation) in a timely manner for me, which slightly reduces the value of info on wikimedia.biterg.io about recent/new contributors.

This is something we have ready and we were about to implement this week. But as you need to perform changes in the db, I guess it's better if you push the latest changes you need into the Github repository. Then, we will import the file in our database to sync (all the UUID's will change) and start synchronizing the file with the Github repository, with the changes you applied and the database we have in production.

What do you think? will that work for you? If you have any doubts or need any assistance let me know!

I'm afraid you are using the latest sortinghat version

Oh, I missed compatibility-between-versions ! Thanks a lot! All makes sense now! I pulled sortinghat 0.2.0.
Going to push an updated JSON file with "old" uuids on European Friday (tomorrow) morning.
Feel very free to sync to production afterwards && export current production into the JSON dump. That would be great. :)

Great!! I'll wait for that and keep you updated of the process

Updated JSON file pushed to GitHub. (Note I used sortinghat export --identities but there is also one new org included: Wikimedia Sverige.)

I will explain you the procedure I will follow for this. The file you were editing comes from the legacy dashboard, so now we are using different sources. Here is the matching:

wikimedia:irc --> supybot
wikimedia:its --> bugzilla
wikimedia:its_1 --> phabricator
wikimedia:mediawiki --> mediawiki
wikimedia:mls --> pipermail
wikimedia:scm --> git
wikimedia:scr --> gerrit

And the procedure will be:

  • Load the file you've uploaded into our production database
  • Export the identities with sortinghat into a file
  • Modify the sources in the file, replacing the old ones with the new ones. This will lead into a duplicated identities in the file, but as soon as we import it, sortinghat will be aware of keeping the relations and not duplicate the identities.
  • Import the file into a clean database
  • Regenerate new indexes

After that, and with the clean database, is when we will export a clean file and start synchronizing it in Github. I just wanted to make you aware of the change in the sources, as I think is an important one and I hope is not inconvenient for you.

I also want to inform you that, even though the database will be up to date, the identities in the dashboard will not be updated automatically. This is a work in progress, and if you want, we can keep you updated of that in other ticket or here.

I will explain you the procedure I will follow for this. The file you were editing comes from the legacy dashboard, so now we are using different sources.
[...]
After that, and with the clean database, is when we will export a clean file and start synchronizing it in Github. I just wanted to make you aware of the change in the sources, as I think is an important one and I hope is not inconvenient for you.

Sounds all great, looking forward to the updated version! Thanks for the heads-up, no inconvenience at all. :)

I also want to inform you that, even though the database will be up to date, the identities in the dashboard will not be updated automatically. This is a work in progress, and if you want, we can keep you updated of that in other ticket or here.

I'm not sure I fully understand. :) Is "the database" the (soon to be updated) JSON dump file in GitHub? Does "up to date" means there will be regular updated DB dumps into that JSON file on GitHub, but if I pushed changes via sortinghat into that JSON dump file, they will not automatically be reflected in production on wikimedia.biterg.io? If so, I'd say that's what this very task is already about.

I'm not sure I fully understand. :) Is "the database" the (soon to be updated) JSON dump file in GitHub? Does "up to date" means there will be regular updated DB dumps into that JSON file on GitHub, but if I pushed changes via sortinghat into that JSON dump file, they will not automatically be reflected in production on wikimedia.biterg.io? If so, I'd say that's what this very task is already about.

Yes, sorry for the misunderstanding. I mean the Github JSON file. There will be regular updated DB dumps. If you push changes, those changes will be automatically pushed into our production mysql sortinghat DB, but not reflected in the dashboard, as for that we need to update the indexes. This is something we are working on with high priority, we have the functionality ready but we still need to integrate it properly. In the meantime, you can notify me whenever you push some changes, and I can update the indexes manually.

@Albertinisg: Awesome! Thank you a lot! I am going to notify you next time I've pushed an updated JSON file (probably about once a month).

All right so, I've generated new indexes with the latest changes and the identities file I'm about to push in Github, and overall it looks good (just IRC is still on its way). Now the file contains around 50k less duplicated identities :)

I'm wondering now what would you like me to do, add a new file in the github repository or, override the one we are using?

Just as a reminder, the new file will be a file for the latest SH version, so if you are still using the old one, you should update it.

@Albertinisg: Nice! Please override the dump file in GitHub. (I'm afraid I'm the only consumer of it anyway, so not many costs in switching back to the latest SH version.)

@Aklapper Github db dump file updated to the latest version by https://github.com/Bitergia/mediawiki-identities/commit/6596aca9e13c1c9e366418eb51af0ca06b7595c5

Next week I hope to start synchronizing the database everyday as we did before. Don't forget to upgrade sortinghat and load the file into a new database :)

Next week I hope to start synchronizing the database everyday as we did before.

great, looking forward to that!

Don't forget to upgrade sortinghat and load the file into a new database :)

In progress... :)

The dump is being uploaded everyday at midnight. Working in parallel in T168217: Find out (and fix) why we have a higher number of identity entries than before switching to new Bitergia DB scheme. I'll keep this open until the feature to automatically update the indexes and show the changes in the dashboard is ready.

Hmm, is it a side effect that Owlbot overwrites any changes that I push to the repository?
See https://github.com/Bitergia/mediawiki-identities/commits/master/wikimedia-affiliations.json :
On July10 I pushed https://github.com/Bitergia/mediawiki-identities/commit/ca811ca666ec3d05f6a7d01dbeb6e9ef172d1887
After https://github.com/Bitergia/mediawiki-identities/commit/39bb9a8f34f9527daf2aa53b5e2c8f4d044338f8 by Owlbot on July11 my changes are not included anymore. I realized because the number of "unique identities" printed after running sortinghat import is back to where it was a day before...

There was an old method to sync databases that it is not working properly. I'm deploying the new one at the same time data gets refreshed and hourly updated.

Data is updated using @Aklapper SH database. Now it is time to enable the latest feature to include these changes without overwriting anything.

Status update:

As discussed in non-public https://gitlab.com/Bitergia/c/Wikimedia/support/issues/1 , the (current) workflow to make me push changes as a JSON file and having the production DB load / merge them and also having owlbot add new data is doomed:

We are using in your use case the "load" feature of SH in a scenario which is different from which it was originally designed. [...] The command "load" does not work for a general sync between the databases [...]

The long-term proposal is to have Bitergia provide me with access to the SH database in production.
Note that I'd still love to have the regular DB dumps to allow me to locally prepare a bunch of sortinghat commands in a bash script, to then run that bash script in production.

Note to myself: Once this has happened, document this lock-in on my private Continuity wiki page.

Aklapper lowered the priority of this task from High to Lowest.Dec 31 2017, 3:34 PM
Aklapper moved this task from Oct-Dec 2017 to Jan-Mar-2018 on the Developer-Advocacy board.
Aklapper raised the priority of this task from Lowest to High.Dec 31 2017, 3:38 PM

This task has high priority but it is not assigned to anyone. Is it committed to this quarter?

Aklapper moved this task from Backlog to March on the Developer-Advocacy (Jan-Mar-2018) board.
Aklapper added a subscriber: Acs.

AFAIK Albertinisg left. The plan is to provide early access to a beta version of Hatstall (browser UI to update profiles) according to the conversation @Acs and I had at FOSDEM 2018. Hence assigning to @Acs. Crossing fingers for late March, might end up in Q2/2018 though.

Aklapper renamed this task from Automatically sync mediawiki-identities/wikimedia-affiliations.json DB dump file with the data available on wikimedia.biterg.io to Provide Hatstall to fix syncing / updating of identities on the wikimedia.biterg.io production instance.Feb 7 2018, 2:39 PM

Since today Bitergia allows us to access a beta version of Hatstall at https://identities.biterg.io/ (thanks a lot!).

I've updated our docs at https://www.mediawiki.org/w/index.php?title=Community_metrics&type=revision&diff=2757707&oldid=2754981 accordingly, plus my internal 'continuity' page about this access.