Page MenuHomePhabricator

Host OKAPI HTML dumps on public-facing labstore servers
Open, HighPublic

Description

Let's work out the details (now updated with answers):

  • How many dump runs do we keep?
    • 3 for now
  • Do we need to stop keeping so many of some other type of dump?
    • Not at the moment
  • What credentials do we need to retrieve files from AWS?
    • Fixed text strings which have been provided to us and will be stored in the private puppet repo
  • We won't be rsyncing because we only want one of the many daily runs that OKAPI will have available; does this mean a custom script?
    • Yes and it's done-ish
  • Do we want to just proxy for the given files instead? This could incur AWS costs, and would mean being clever about only serving requests for certain runs.
    • No we don't.

What about the future? There will be other datasets; what will we do about space for those? That will be discussed in a future task if needed.

TODOs remaining:

  • Add cleanup of tmp files to downloader script
  • Add puppetization of enterprise_html/run and tmp dirs to puppet so script does not need to create them
  • Add downloader script to puppet
  • Add Enterprise credentials to puppet
  • Run downloader script via systemd timer instead of manually -- IN PROGRESS
  • Run rsync from web server to nfs server (i.e. from one labstore box to the other) of Enterprise dumps, via systemd timer
  • Add cleanup of older downloads so we keep only a specified number of runs

I'll add other things if they come up, hopefully they won't.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hey @ArielGlenn https://api.enterprise.wikimedia.com/v1/docs/index.html is the new hostname. Sorry, I thought I sent you an email with additional info. Should have posted it here initially.

Thanks, I've got things working now. The hostname was wrong :-) And I forgot to save this comment, written on the same day you answered my question!

I'll be doing some live testing on the labstore web server on Monday, a script that is not perfect but is ok for testing is ready to go, but just in case it saturates disk or bandwidth we don't want any pages on the weekends. WMCS folks have it on their radar.

@Protsack.stephan Just a note that when I retrieved the full list of projects and sizes a couple days ago, the size listed for alswiki was

{"name":"Wikipedia","identifier":"alswiki","url":"https://als.wikipedia.org","version":"bf26de5468eb5139d3c34b2cf58f1d2b","date_modified":"2021-10-12T00:14:26.239013126Z","size":{"value":7.89,"unit_text":"MB"}},

and now today's list seems ok. Any idea what's going on there? It's not that I use these sizes for anything but that seems concerning. (Size listed today is 272.5 MB)

We were missing some data on couple of projects, so I ran couple of scripts to fix that. And this project was one of them. So no worries, don't envision this kind of rapid size growth in the future.

Ah no, I knew the project hand't grown, because I checked the xml/sql dumps. But I was concerned that data was incorrect or missing. Thanks.

Script suitable for testing with minimal monitoring now at https://github.com/apergos/okapi-downloader and I expect to do at least a partial test run later today.

Change 731768 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] index page and directory for Wikimedia Enterprise HTML dumps

https://gerrit.wikimedia.org/r/731768

I am starting a test run of the script wm_enterprise_downlaoder.py in a screen session created by the user 'ariel' on labstore1006.wikimedia.org; the script is running as the dumpsgen user. You can see what it is doing by looking at /var/lib/dumpsgen/html_downloader.log on the same host.

Let's see how it goes!

Namespace 6 has completed, and namespace 14 is underway.

Amazing :), excited to see how it does!

The script just completed, for a runtime of less than 15 hours, not bad at all. Now to do the integrity checks.

Nice, thanks for keeping us posted.

Files look great. I'll make them available soon, without a public mailing or announcement though, that's y'all's department :-)

Integrity checks took 30 minutes total to cmoplete, I'll likely fold that check directly into the download script.

Change 731768 merged by ArielGlenn:

[operations/puppet@production] index page and directory for Wikimedia Enterprise HTML dumps

https://gerrit.wikimedia.org/r/731768

Thank you very much everyone!

@ArielGlenn - can you please add ''enterprise'' as a listing on https://dumps.wikimedia.org/other/ ? Thank you

Thank you very much everyone!

@ArielGlenn - can you please add ''enterprise'' as a listing on https://dumps.wikimedia.org/other/ ? Thank you

I was planning to add it as soon as you're ready to go public about this access. I'd follow it up with mail to the usual places (wikitech-l, the research mailing list, xmldatadumps-l).

Change 731981 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] add Enterprise HTML dups to the other dumps listing

https://gerrit.wikimedia.org/r/731981

Change 731981 merged by ArielGlenn:

[operations/puppet@production] add Enterprise HTML dups to the other dumps listing

https://gerrit.wikimedia.org/r/731981

After a discussion with Enterprise folks via chat, I've gone ahead and updated the 'other dumps' page and will send the usual emails shortly. Yay!

Mail sent to wiki-research-l, xmldatadumps-l, wikitech-l.

TODOs from my end:

  • integrate the integrity check *really just an md5hash check) into the downloading script
  • make sure that when we get a 404 it's handled properly
  • think about whether to split up the download into three jobs, one per namespace, or do the large wikis on a separate day, anything to split up the run so each portion can start after the dumps for the day are available from Enterprise for download, and so downloads, complete before the next round starts being generated
  • automate via systemd timer

Requests for your end:

  • bz2 or other better compression, gz is very ineffcient and we should not assume that our users all have high bandwidth connections

I did a one-off rsync of the enterpris_html/runs/ directory from labstore1006 to labstore1007. Command (from labstore1007 as root):

rsync -av   --bwlimit=160000  labstore1006.wikimedia.org::data/xmldatadumps/public/other/enterprise_html/runs /srv/dumps/xmldatadumps/public/other/enterprise_html/

This took about 80 minutes.
We'll want to make that a regular job that runs once the dumps are downloaded.

We had a chat internally @ WMCS about this, and I think the summary is:

  • we are ready to support this WME work, we are happy to help and coordinate with @ArielGlenn and others over this.
  • current WMCS team members don't have a lot of domain/operational expertise over the dumps service in general.
  • because of this, we may request additional details and clarification when joint work is being planned (sorry in advance!)

That being said, ping any of us over IRC when the next step/operation is scheduled!

I would say a smaller subtask with smaller scope should improve acknowledgement/intake/response speed on our side

My personal feeling is that this could fit very well into our WMCS clinic duty list of tasks: make sure operations on dumps are supported and we are in sync with the other folks, or something like that.

Change 734613 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] puppetize directories for Wikimedia Enterprise dumps

https://gerrit.wikimedia.org/r/734613

Change 734613 merged by ArielGlenn:

[operations/puppet@production] puppetize directories for Wikimedia Enterprise dumps

https://gerrit.wikimedia.org/r/734613

Change 734622 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] add the Wikimedia Enterprise content downloader script

https://gerrit.wikimedia.org/r/734622

Hey WMCS folks, I have a patch (see above) that simply adds the downloader script to puppet on the labstore that is the web server only. At least I think only the web server. No one should feel obligated to review the script itself, though I am happy if someone does or if you want to ask me questions about how it works. But would someone be willing to check that I'm incorporating it into the rest of the setup the way you folks prefer? Thanks in advance!

Starting the 1st of the month run manually from a screen session belonging to ariel, on labstore1006:

python3 ./wm_enterprise_downloader.py --verbose >/var/lib/dumpsgen/html_downloader.log  2>&1

Downloads done, now running manual rsync pulling the files to labstore1007.

The rsync has completed and the dumps are now available. Note that for two runs we have 1.3T of storage used. Let's factor that into our decision of how many runs to keep.

Change 734622 merged by ArielGlenn:

[operations/puppet@production] add the Wikimedia Enterprise content downloader script

https://gerrit.wikimedia.org/r/734622

Change 736441 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] fix up name of enterprise dumps downloader script

https://gerrit.wikimedia.org/r/736441

Change 736441 merged by ArielGlenn:

[operations/puppet@production] fix up name of enterprise dumps downloader script

https://gerrit.wikimedia.org/r/736441

Change 736461 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] add credentials file for downloading enterprise html dumps

https://gerrit.wikimedia.org/r/736461

The above patch adds the credentials needed for downloading, in a format the downloader script understands. The corresponding file has been added to the private repo (but don't trust me, if someone would double check it please!) but I don't know if I need to add "fake" creds for labs/private or not. Can some WMCS folk chime in? Thanks in advance!

Change 736527 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[labs/private@master] add fake enterprise api dumps downloader credentials

https://gerrit.wikimedia.org/r/736527

Change 736527 merged by ArielGlenn:

[labs/private@master] add fake enterprise api dumps downloader credentials

https://gerrit.wikimedia.org/r/736527

Change 737637 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[labs/private@master] Move enterprise dump download creds to more canonical dir

https://gerrit.wikimedia.org/r/737637

Change 737637 merged by ArielGlenn:

[labs/private@master] Move enterprise dump download creds to more canonical dir

https://gerrit.wikimedia.org/r/737637

Change 736461 merged by ArielGlenn:

[operations/puppet@production] add credentials file for downloading enterprise html dumps

https://gerrit.wikimedia.org/r/736461

Change 737854 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] [WIP] add enterprise html dumps downloader settings file and systemd timer

https://gerrit.wikimedia.org/r/737854

Doing a manual run, started it a bit later than the 20th of the month because of the weekend intervening. Running as dumpsgen in a screen session owned by ariel on labstore1006.

The download is complete. Starting the rsync run now in a screen session from the user ariel on labstore1007, running rsync as root.

Change 740632 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] fix up arg processing for Enterprise downloader script

https://gerrit.wikimedia.org/r/740632

Change 740632 merged by ArielGlenn:

[operations/puppet@production] fix up arg processing for Enterprise HTML dumps downloader script

https://gerrit.wikimedia.org/r/740632

Starting the Dec 1 manual download from a screen session owned by ariel on labstore1006.wikimedia.org.

Starting rsync now of downloaded December HTML dumps, from labstore1007 in a screen session owned by ariel.

The rsync is complete and dumps should be available.

Hey @ArielGlenn , we are about to change our authentication to JWT that will require updates to downoader script. Do we plan do any more downloads this by the end of this year,?

Yes, we do. They go tweice a month every month, on the same dates as the regular SQL/XML dumps. When do you plan to make this update? What will it mean for my little python script? Bear in mind that I generally need time to fold work like this into my schedule.

We're trying to push it through really quickly so we are planning to release it this week. Really sorry for late notice, but I really think we can work this out painlessly so we don't need to rush anything.
I don't want to break the download so I think I can whitelist the IP of the server that makes the download and make it so current code will still work till you transition to the new auth.
Do you have a static IP that I can whitelist?

If we do IP whitelist your script can remain unchanged (at least for now).
But in the future you'll basically need to login with creds that we'll provide obtain refresh (valid 90 days) and access (valid 1 day) tokens, then add Authorization: Bearer <access_token> header to your requests, when access token expires you'll be able to use refresh token to get a new one until that token expires. I'll share more details about actual endpoint and everything else with you when it's ready.

Manual download for the 20th started, running as dumpsgen user from a screen session on labstore1006, logged in as ariel.

rsync of downloaded HTML dumps on labstore1007 started in screen session from user ariel, running as root.

Manual rsync completed earlier so this set of downloaded files should now be available both to cloud instances and for web download.

@ArielGlenn Have you had any auth related issues during this run? Just want to be sure that our whitelisting worked. Thanks.

No. If there had been issues I would have seen them right away, i.e. yesterday when starting the run. Thanks for checking though.

Change 749875 had a related patch set uploaded (by RhinosF1; author: RhinosF1):

[operations/puppet@production] Update static html dump to mention enterprise

https://gerrit.wikimedia.org/r/749875

Change 749875 merged by ArielGlenn:

[operations/puppet@production] Update static html dump index.html to mention Wikimedia Enterprise HTML dumps

https://gerrit.wikimedia.org/r/749875

<snip>

If we do IP whitelist your script can remain unchanged (at least for now).
But in the future you'll basically need to login with creds that we'll provide obtain refresh (valid 90 days) and access (valid 1 day) tokens, then add Authorization: Bearer <access_token> header to your requests, when access token expires you'll be able to use refresh token to get a new one until that token expires. I'll share more details about actual endpoint and everything else with you when it's ready.

Hey Stephan,

I looked in the available docs on mediawiki.prg and also at the github repo https://github.com/wikimedia/OKAPI/tree/master/service but did not find any information about WME's use of JWTs. Can you point me at something that documents how the username and temp password are to be used to get the refresh and access tokens? What algorithm is used, what components are in the payload? Thanks.

Hey Ariel,

We have not had time to update mediawiki.prg yet, so I've sent you an invitation with our credentials. If you follow the steps described in the email you should be able to reset temporary password through our dashboard, there you'll find Authentication API Reference section where you'll see all necessary endpoints to exchange your credentials to JWT tokens.

Refresh token is active for 90 days, id and access token for 24 hours.
After you finish the development we can issue creds to other email (I remember you've mentioned it in another ticket).
Let me know if you have any questions after reading through the docs, I'm more than happy to help.

I have tried logging in on the page https://dashboard.enterprise.wikimedia.com/ and at https://dashboard-dv.enterprisewikimedia.com/login with both the credentials just sent and the ones from December 15th and none of them work; I get the error Username or Password is incorrect . I have tried on both dashboards the "forgot password" link for both usernames, hoping that one of them was set up as linked to my wikimedia email address, but this appears not to be so; I get the notification that "Confirmation code has been sent via email." but no email arrives. (Nothing in spam either.) Can someone have a look? Thanks.

That's strange, looking into it, will let you know what I find.

Revoked all the previous access and sent new invitation to your email address, before doing that double checked if everything is working by sending invitation to my own email address. You should've received email with your username and temporary password, by following the link in the email on the first login you will be asked to change your temporary password and then you can use your permanent credentials to exchange them for JWT tokens. LMK if that does not work.

The new credentials worked for the link in the email. I'll test out the JWT access and let you know. Thanks!

I have now done successful downloads with the retrieved auth tokens. It looks like there is no expectation of a shared secret between server and client for the acquisition of these tokens. Is there a plan to move to that in the future?

Change 755345 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] update wme html dumps downloader to use JWT auth tokens

https://gerrit.wikimedia.org/r/755345

I have now done successful downloads with the retrieved auth tokens. It looks like there is no expectation of a shared secret between server and client for the acquisition of these tokens. Is there a plan to move to that in the future?

It's not something we have in plans for recent future, but that's definitely possible. We'll inform you in advance if we decide to do so.

The latest download and rsync are complete, so the HTML dumps from the 20th should now be publicly available.

Change 755345 merged by ArielGlenn:

[operations/puppet@production] update wme html dumps downloader to use JWT auth tokens

https://gerrit.wikimedia.org/r/755345

Change 737854 merged by ArielGlenn:

[operations/puppet@production] [WIP] add enterprise html dumps downloader settings and credentials files

https://gerrit.wikimedia.org/r/737854

Gah, stared and stared at the changes and the puppet compiler output, forgot to remove the WIP from the commit message. Oh well

Change 755979 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] add systemd timer for Enterprise HTML dumps download

https://gerrit.wikimedia.org/r/755979

@aborrero would you have time to look at the above patch sometime this week? The goal is that downloads always happen on the web server, and rsyncs are pulled from there to the server(s) that are not the web server, In the case that one server is handling all functions (let's say for maintenance), no rsync would run. But once the host in maintenance mode came back on line, an rsync would have to manually be run, to bring things back into sync.

@aborrero would you have time to look at the above patch sometime this week? The goal is that downloads always happen on the web server, and rsyncs are pulled from there to the server(s) that are not the web server, In the case that one server is handling all functions (let's say for maintenance), no rsync would run. But once the host in maintenance mode came back on line, an rsync would have to manually be run, to bring things back into sync.

done!

Change 755979 merged by ArielGlenn:

[operations/puppet@production] add systemd timer for Enterprise HTML dumps download and rsync

https://gerrit.wikimedia.org/r/755979

We keep two runs a month, I'd like to keep 3 months' worth of runs. How are we on space for that? I think we'd agreed that this is ok, but I'll check with WMCS folks just to make sure, as it's been a little while and new boxes are soon to be on order.

Change 756596 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] clean up older enterprise html dumps, keep the last 6 runs

https://gerrit.wikimedia.org/r/756596

Change 756596 merged by ArielGlenn:

[operations/puppet@production] clean up older enterprise html dumps, keep the last 6 runs

https://gerrit.wikimedia.org/r/756596