Username: bmansurov
Full name: Bahodir Mansurov
request to add existing admin group "researchers" to servers with role statistics::private (stat1005, stat1007)
(in addition to statistics::crunchers)
Username: bmansurov
Full name: Bahodir Mansurov
request to add existing admin group "researchers" to servers with role statistics::private (stat1005, stat1007)
(in addition to statistics::crunchers)
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • bmansurov | T171227 [Objective 9.1.4] API and GapFinder prototyping and testing | |||
Resolved | • schana | T148129 Productization of Recommendation API | |||
Resolved | • bmansurov | T201192 Build API to surface 'morelike' article recommendations for missing articles | |||
Resolved | • bmansurov | T208622 Import recommendations into production database | |||
Invalid | Request | None | T210757 Add existing group "researchers" to "hosts that can produce recommendation API dumps" (role statistics::private) |
i suggested to create this access request.
per IRC chat, adding some detail
what is needed / requested here: add the existing admin group "researchers" to "hosts that can produce recommendation API dumps".
What i know is that stat1007 is an example of a host that fulfills these requirements, but stat1006 is not and the reason for that is "because of pyspark".
Given that stat1007 (and stat1005) have role(statistics::private) it should mean adding the researchers group to that role via Hiera, role/common/ (we are not specifying host names here at all, just roles).
Why is that requested? Because the "researchers" group can then be used to limit access to puppetized mysql credentials for the recommendation-api database. We already have a puppetized example of doing just this, we write a .my.cnf and then give "researchers" the read access to it. So that's nothing new. The issue is just that we don't have that group on hosts that have pyspark installed and can be used to create the needed dumps.
Also, researchers is simply the right group, access should be given to all members of the research team but it's not really related to members of the other groups analytics-privatedata-users and statistics-privatedata-users. bmansurov just has shell access here via one of those other groups, which are actually for unrelated things.
There is likely already a large overlap of users in researchers and the "privatedata" users groups but they are not identical. So some users would get new access via this but it seems the right thing to do. It's just adjusting to reality that researchers also have access to "statistics::private" servers and does it in a clean way that also scales to changes in the research team without requiring separate individual access requests in the future.
what it would mean: all members of "researchers":
members: [awight, bmansurov, catrope, dartar, mneisler, dduvall, esanders, ezachte, gilles, halfak, jforrester, jkatz, jmorgan, kaldari, nathante, leila, mattflaschen, milimetric, nettrom, bearloga, nuria, ori, otto, cooltey, tonina, mforns, jdlrobson, dr0ptp4kt, tgr, marktraceur, jhernandez, joal, daisy, mholloway-shell, ebernhardson, niedzielski, neilpquinn-wmf, tbayer, dbrant, maxsem, jminor, etonkovidova, sbisson, addshore, matmarex, elukey, nikerabbit, dstrine, jsamra, jdittrich, chelsyx, ovasileva, mtizzoni, panisson, paolotti, ciro, debt, fdans, mlitn, niharika29, goransm, pmiazga, dsaez, shiladsen, cicalese, mirrys, sharvaniharan, mmiller, amire80, rush, tieu, kharlan, gbirke, isaacj, jdl]
get a new shell on stat1005/stat1007 unless they already have it anyways through one of:
members: [ezachte, milimetric, dartar, halfak, awight, dr0ptp4kt, nuria, leila, nettrom, mforns, bmansurov, tbayer, joal, imarlier, tjones, legoktm, dcausse, bearloga, atgomez, dstrine, marktraceur, mtizzoni, panisson, paolotti, ciro, melodykramer, fdans, shiladsen, esanders, risler, nathante, chelsyx, jdl]
or
members: [dartar, halfak, jdlrobson, jmorgan, bearloga, mattflaschen, mhurd, awight, jforrester, marktraceur, nuria, leila, gilles, dbrant, tgr, dr0ptp4kt, brion, bsitzmann, amire80, dduvall, nettrom, mforns, jkatz, ebernhardson, mlitn, tbayer, joal, kartik, nikerabbit, pcoombe, neilpquinn-wmf, maxsem, jminor, atgomez, dstrine, ladsgroup, ovasileva, shiladsen]
It might be surprising how large the research team actually is though.
The alternative is creating an entirely new group with a better name that is limited to fewer people (who?) and that is added to these hosts.
The actual need is "group of people that have access to the recommendation API mysql db" larger than 1 but smaller than "anyone with shell".
What are recommendation api dumps? If they are destined for production api we should proably find a better place for them to be produced that stats boxes, this process seems similar to what search platform team does and for that they use hadoop not stats boxes. Maybe I am missing context here. Could @bmansurov could expand a bit what are "recomendation api dumps"? "who consumes those? the api? external users?
@Nuria we are using Spark, Wikidata dumps in Hadoop, and some Hive tables to generate recommendations on stat1007. The results are saved as TSV files and are going to be imported into MySQL. The data will be exposed as a service. Users include ContentTranslation for now. Currently, we're trying to find the best way to import those TSV files into a production Mysql database.
@bmansurov We do not recommend to generate these in stats boxes, stats boxes are reserved for data munching usage for regular users, not a production service.
That is what hadoop is for, we have 50 hosts in hadoop so we have 1) more availability and 2 ) more resiliency. Let's work together to move these jobs to oozie/spark in hadoop and we can help you loading it in your storage. This use case is similar to data loading for any of our APIs.
@Dzahn let's please hold on on any changes, stats boxes are mean for individual data munching by users, once we are producing data for a production service we should move those jobs to hadoop and load data to service async.
@Nuria understood! thank you for your prompt comments and don't worry, this wouldn't have moved without approval but was started to get this discussion going and to describe the current technical blocker of puppetizing that .my.cnf file on stat1007.
@Nuria I also have this pending gerrit change that i will put on hold https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476098/ It was meant to git clone the software that @bmansurov needs to create the dumps into a new /srv/research dir on stats::private hosts. Of course that could easily be applied on other hosts, i just wouldn't know which ones would be the right ones.
@Dzahn I see, Let's abandon that change. Stats machine's capacity is to be used by individuals, not production services. We have 50 nodes in hadoop that much better suited for this purpose. They provide more computation power but also more resiliency.
Ok Nuria! makes sense. I abandoned the change and closing this access request as invalid/declined.