Page MenuHomePhabricator

Archive /home/ezachte data on stat1007
Closed, ResolvedPublic

Description

/home/ezachte on stat1007 is 687G. Most of that is in

128G	./wikistats_backup
557G	./wikistats_data

Assuming this data isn't updated or rsynced to stats.wikimedia.org anymore, can we either delete or archive it in HDFS?

Event Timeline

Ottomata renamed this task from Archive /home/ezacthe data on stat1007 to Archive /home/ezachte data on stat1007.Nov 14 2019, 6:25 PM
fdans triaged this task as Medium priority.Nov 14 2019, 6:33 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans subscribed.

Let's archive this in HDFS

I have just reapplied for server access with John Bond
I was supposed to add the new public key myself at https://phabricator.wikimedia.org/T215790, but I can't even view that ticket as Erik_Zachte (ezachte).
Once I'm back online I will review the folders mentioned here, and comment.

This took a while, as I was totally focused on OpenStreetMap this summer (doing field surveys). :-)

@Erik_Zachte Hi! Gentle ping to see if you have time to review the files during the next days :)

@elukey Hi! I'll get to this in coming days. Thanks for your patience.

So I looked first into the cron processes that are still enabled on home/ezachte. There are two.

One is running fine (compressing page view counts into daily/monthly zips for 3rd parties).

The other one is running fine up to the rsync step which fails, so this one hasn't been published after March 26.
See e.g. https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm

I copied a small part of the bash file to home/ezachte/wikistats/dammit.lt/bash/test_rsync.sh

Sun Dec 15 12:29:53 UTC 2019
+ cd /home/ezachte/wikistats/dumps/perl
+ rsync -av -ipv4 /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthly.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyAllProjects.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyAllProjectsOriginal.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyCombined.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyMobile.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyOriginal.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyOriginalCombined.htm /home/ezachte/wikistats_data/dumps/out/out_wp/EN/TablesPageViewsMonthlyOriginalMobile.htm thorium.eqiad.wmnet::stats.wikimedia.org/htdocsEN
opening tcp connection to thorium.eqiad.wmnet port 873
sending daemon args: --server -vvlogDtpre.iLsfxC "--log-format=%i" . stats.wikimedia.org/htdocs
EN (5 args)
@Error: Unknown module 'stats.wikimedia.org'
rsync error: error starting client-server protocol (code 5) at main.c(1666) [sender=3.1.2]
+ exit

Any suggestion?

Read '../' as '/home/ezachte/'

high level folders > 1GB:
A 270G ../wikistats_data/dammit
B 203G ../wikistats_data/dumps
C 138G ../wikistats_backup/
D 120G ../wikistats_data/squids
E 2G ../wikistats_data/mediacounts

As I said before one of the active cron jobs runs OK, but archiving its output (see A) into hdfs does not.

It is supposed to copy the huge bz2 files to hdfs and then update a local list of checked in hdfs contents,
so that after inspection of that list old bz2 files can be removed manually, say once year.

I now made a partial version of that bash file, which only updates the local list of checked in hdfs contents:

stat1007:~/wikistats/dammit.lt/bash/test_hdfs.sh
with line:
hdfs dfs -ls -R [remote path] > [local path]

The long list of errors starts with "No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]"

Read '../' as '/home/ezachte/'

high level folders > 1GB:
A 270G ../wikistats_data/dammit
B 203G ../wikistats_data/dumps
C 138G ../wikistats_backup/
D 120G ../wikistats_data/squids
E 2G ../wikistats_data/mediacounts

About D: these are csv files which were collected daily -up till 2015- from 1:1000 sampled 'squids' page views/edits logs (the term 'squids' is now outdated). Their only potential use is for in-depth analysis of historic view/edit patterns. These are highly granular data (e.g. with view counts per 15 min per country) with many metrics (but unfortunately in an earlier high priority analysis proved inadequate for the question at hand).

These daily csv files have also been zipped into yearly zip files (see C), and use 79G out of 128 GB for backup space.

Recommendation: archive for 'you never know when data archeologists take notice (again)'. The yearly zip files in ../wikistats_backup should suffice.
1 Somehow verify the yearly zips in C are still valid and contain all data
2 If OK, delete daily csv files from D
3 Move yearly zips from D to hdfs
4 Delete yearly zips from D

@Erik_Zachte we are enabling kerberos on the whole cluster, so there's another layer of authentication. The announcement went out on the Analytics list and it's happening today, that's why you'll get some nasty error messages that have somewhere in them something like you pasted above. To authenticate with kerberos, you'll need to ask Luca to enable it for your user, but I'm not sure you need that if we just want to archive stuff. Let me know, and I can point you in the right direction.

I think I should not be involved here.

It seems I added @Error inadvertently. Is that a bot? Or a playful nickname?

@Milimetric thanks for heads-up.

Above I mentioned two still active data streams with issues. Here is one, second one in separate post.

The daily updates to page view reports are produced normally, as they have been since 11 years. Yet they aren't published since Feb 2019, as the rsync fails.
Incidentally this report is mentioned in https://wikitech.wikimedia.org/wiki/Analytics/Wikistats/Deprecation_of_Wikistats_1 with 'should still work after the change'.
Sadly in a sense it doesn't work for 10 months already, with tiny effort required to fix this.

I don't see anything in Wikistats 2 that provides this overview of all wikis per project on one page.
The report is nerdy, for insiders mostly (esp. the table part), but still useful. I could tweak the script to skip the table part, or only show the latests 24 months.

Second issue:

We have two versions of pageviews per wiki per page, and again two versions of the latter

- chart form: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&pages=Dog
- data files: 
   = hourly  https://dumps.wikimedia.org/other/pagecounts-raw/
   = monthly https://dumps.wikimedia.org/other/pagecounts-ez/merged/

The monthly aggregates are still updated. That's great :-)

background: I consider this the most important data stream from Wikistats 1. It is not about Wikimedia projects per se. It is about what the world at large learned about in our age. I see it as complementary to the Twitter Archive at Library of Congress. If only we had such a treasure trove for data archaeologists from e.g. WW II, it would be used by many scholars. Its importance will grow in coming decades, as the data age and ripen. BTW it was a community project that I took over, as it was better to keep it going on Wikimedia servers.

There is some redundancy for those data files, as hourly and monthly files are both publicly available (albeit on same server).
But who would want to download and archive 720 hourly dumps when there is an aggregate version, with less than one percent in overall size, with no granularity lost?!

So the data gathering and publication is OK. But as for long term preservation, I'm not so sure about that part, with a single copy on dumps.wikimedia.org.
That's why I started to backup to hdfs with its much better redundacy and failover. That hdfs backup part is broken now.

I'd rather see this cron job migrated away from my account, with someone else keeping an eye on it.
The perl part of the job is quite stable. The bash part could use a tweak to make it fully automatic
(check that all files have reached hdfs before deleting monthly aggregates from stat1007).

Thanks

It seems I added @Error inadvertently. Is that a bot? Or a playful nickname?

You did it again. Please do not.
Don't excuse yourself. Just do not add me.

@Milimetric do you think that we could progress this as part of this and the next ops weeks?

Hi Erik, we took a closer look, here are our thoughts:

  1. T238243#5751379: the daily updates to pageview reports (rsync failing). The rsync is failing because we are no longer allowed to push updates. We are leaning towards turning these jobs off because they've been broken for almost a year and nobody has complained so far. We are working on providing more in-depth and big picture overall views into this data from Wikistats 2. But if you feel strongly that we should keep updating those reports in the meantime, we can work on that together.
  1. T238243#5751717: this should have been migrated to Hadoop with T192474 but I'm confused to see we declined that task. I will pick that back up and take care of the migration. That will mean that pagecounts-ez generation will be happening exclusively on Hadoop and the data and process maintained and backed up like all our other jobs. I agree with you this is very important. As part of this I'll also import and generate as much historical data as possible from the archives.

If you agree with all this, then cleanup for us is simple. We will delete any redundant pagecounts data from your directories and archive the rest to HDFS for safekeeping and future data archaeology enjoyment :) Let us know otherwise, we thought of some alternative ideas already.

I have reopened T192474: Migrate pagecounts-ez generation to hadoop let's triage it as part of our workflow but let's not start working on it quite yet.

We are leaning towards turning these jobs off because they've been broken for almost a year and nobody has complained so far.

I also wondered if no-one cared about these reports. So yesterday I asked on Wikipedia Weekly. Some people still care.

https://www.facebook.com/groups/wikipediaweekly/2607209029326912/

We are leaning towards turning these jobs off because they've been broken for almost a year and nobody has complained so far

While the problem is rsync now, these jobs are neither documented nor puppetized. It is very likely they will break again when we move boxes/users/permits so in the absence of a full rewrite I also vote for turning them off. I do not dispute that having all wikis in one page might be a report of interest for some of our users, if so, we should build this UI (from 2015 onwards) from the pageview API data in Wikistats2.

@Nuria are you saying a fix that might take an hour, if not less, is not done, because another issue might popup in the future? It's not that you're committing for eternity to uphold Wikistats 1. Maintaining the perl scripts has never been expected, I've always been open about these being maintenance unfriendly.

I'm totally expecting any future config chance will break Wikistats 1. All server migrations have done this so far. In ways that wouldn't depend on puppetizing (e.g. typo in rsync command did erase all timestamps for all migrated data). Other issues were fixed instantly when brought to attention.

As for replacement, if you don't mind, I'm not holding my breath. What came of earlier promises, done years ago? Ref. https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_report. Don't throw away a working solution when a replacement is still vaporware.

There is another side to this argument that no-one complained about the reports not being updated. I used to be a user of these reports myself, to monitor our traffic. On several occasions I found large anomalies proactively and researched these (e.g. botnets). Has this health aspect of traffic monitoring been removed from WMF?

From the Facebook discussion on Wikipedia Weekly it seems to me some people show apathy for what to expect from user feedback. One quote 'phabricator is where feedback goes to die, so i've lost interest.' That certainly doesn't do justice to the commitment of the Analytics Team. But it might be something to keep in mind when you question user feedback.

We are leaning towards turning these jobs off because they've been broken for almost a year and nobody has complained so far.

I also wondered if no-one cared about these reports. So yesterday I asked on Wikipedia Weekly. Some people still care.

https://www.facebook.com/groups/wikipediaweekly/2607209029326912/

I answered to the facebook post, it is good to get feedback from the community. I'd love to see more people proactively reporting missing features in v2 via Phabricator, we'll see if things change in the future :)

@Nuria are you saying a fix that might take an hour, if not less, is not done, because another issue might popup in the future? It's not that you're committing for eternity to uphold Wikistats 1. Maintaining the perl scripts has never been expected, I've always been open about these being maintenance unfriendly.

I have to disagree on this point, since a proper fix will not take one hour. We have profound respect for Wikistats v1 (for real, it consider it an amazing work), but as you pointed out maintaining all the perl codebase alongside with the new v2 one is challenging. The page that we are talking about currently needs:

  1. a cron running on a host, together with data to work on. Your home directory is close to ~600G, and for us it is difficult to figure out what part of it is needed and what not (we are trying to figure it out with your help in this task :) The ideal solution is to create a system user and use a more up to date systemd timer probably.
  2. a rsync configuration to move HTML pages from one place to the other, that was disabled for security reason (this is why the report is currently not up to date). A proper fix would then require us to find another way to get up to date reports (probably pulling from thorium? It would need some time to check/investigate).
  3. alarming to figure out when the report stops working, to avoid having people to ping us (if it doesn't happen like in this case, then we'll not work on it for months).

Plus the engineering time to keep it working for the next months, since little issues can arise, etc.. And as you pointed out on fb:

I might add that the reports need tweaking if continued.They have grown too large, with too much granularity. Monthly data per wiki since 2008 is a bit much for 200+ wikis.

I'm totally expecting any future config chance will break Wikistats 1. All server migrations have done this so far. In ways that wouldn't depend on puppetizing (e.g. typo in rsync command did erase all timestamps for all migrated data). Other issues were fixed instantly when brought to attention.

As for replacement, if you don't mind, I'm not holding my breath. What came of earlier promises, done years ago? Ref. https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_report. Don't throw away a working solution when a replacement is still vaporware.

This is still our commitment, what we are trying to figure out now is what the community and the WMF really needs to focus our attention and time on it. This will allow us to focus on properly designed and working solution, avoiding to maintain too many things :)

There is another side to this argument that no-one complained about the reports not being updated. I used to be a user of these reports myself, to monitor our traffic. On several occasions I found large anomalies proactively and researched these (e.g. botnets). Has this health aspect of traffic monitoring been removed from WMF?

I'd say no, since we constantly work on the quality of our data, please see things like T235486. The community also alerts us on the analytics@ mailing list, as you can read in the last days we have people reporting anomalies in pageviews and we promptly reacted and fixed the problem that was brought up. Our current focus (thanks to Nuria's and Marcel's work) is to proactively analyze our data and have automated ways to report anomalies, to be able to spot them earlier (that could be missed by visual review by people for example etc..).

From the Facebook discussion on Wikipedia Weekly it seems to me some people show apathy for what to expect from user feedback. One quote 'phabricator is where feedback goes to die, so i've lost interest.' That certainly doesn't do justice to the commitment of the Analytics Team. But it might be something to keep in mind when you question user feedback.

I do agree, but we cannot keep up with all channels to seek user feedback. I tried to do it on the facebook group today, but I wasn't aware of it before you mentioned (and it is not scalable for us to read all the comments everywhere). Hopefully keeping up with the good work will increase the community trust in the Analytics team :)

Thanks for the feedback Erik, I really appreciate that you keep helping us!

@elukey thanks for continuing a constructive dialogue.

Fixing the cron job, the rsync rights, and monitoring updates is all in the domain of ops, right? Or should I say bash related. I can see how I was over-optimistic about doing this in an hour, as -like you say- there are several issues to look into.

But tweaking the functionality and size of the reports, which requires perl maintenance, I wouldn't expect anyone to do this but me.
I'm less into programming these days, and certainly not aiming for a major coding task (any new work would probably be for OpenStreetMap or my personal art projects).
But producing a leaner version of these page views reports I can still do. I'm thinking of cutting down the numbers of rows to present, and removing some derived metrics entirely, like ranking position for each wiki, for each month.

BTW Wikipedia Weekly is one of the best places to learn what our community members find noteworthy. It used to be a lively podcast, initiated by Andrew Lih.
But weekly updates by a core team became a burden, so the new format is now a lively Facebook group. It's often more informal and a bit more lighthearted that e.g. the Wikipedia Signpost.
I wouldn't expect anyone to track stats related issues there.

I like T235486.

Maybe I have rose-colored glasses on, but I read the facebook comments a couple of times and it seems to me we have plans to address all the concerns:

  • people like the huge table with data for all wikis over all data available back to 2008. We are going to make this available in v2 via the new "big table" design. I consider this part crucial to a "production" release of wikistats v2, and I'm excited to get it done. Then Erik doesn't have to worry about maintaining the old job and we can manage everything as part of the same pipeline.
  • people want all available data to remain available. Totally agreed, we spend most of our resources on this.
  • people want total editor numbers across wikis. This, honestly, is a mistake we made in the original design of the ws v2 data pipeline. We are working on a fix, but again this is a crucial part of a production release.

So I don't think the comments disagree with our planning. The subtext is more like "we like Erik, who the hell are these new people and why are things changing". Which is valid, and if folks want to have a constructive conversation about that I'm happy to have it either here or on fb.

And yeah, some people said they had a hard time with phabricator. Working together is hard, and some folks don't have time or patience for change. That's just a fact of life, unless we can figure out how to clone Erik with all his community connections and goodwill, we'll have to live with some people being disappointed. If people give us a chance, we're pretty nice too :)

I started a page on Wikistats 1 on https://meta.wikimedia.org/wiki/User:Erik_Zachte/Wikistats%201. Other than earlier overview pages for Wikistats this one is focused on where do we stand with Wikistats 1 now, in the light of the migration to Wikistats 2. What in Wikistats 1 still works? (several crucial data streams). What has been disabled ? (some of it prematurely I would say). What is not inside the scope of earlier surveys, but would be a pity if it got lost all-together (some of the Viz's). I give credits to new developments, but also make some critical remarks at the end.

@Milimetric you say "we're pretty nice too :)" I totally concur, that goes for all of you, but you in particular I regard as one the nicest colleagues I met at Wikimedia.

You also say "some folks don't have time or patience for change." Here I like to add my own perspective. The clock for Wikistats 2 started to tick in 2012 when we convened in Berlin, and talked about hadoop, and D. said "give us half a year to replace Wikistats", or words to that effect. We now know how that was extremely optimistic, given the small team and complex tasks ahead. I believe it was 2015 till first results were harvested from hadoop. Then we had surveys in 2015 and 2016 about what people would like to see salvaged or replaced and improved from Wikistats 1.

Around 2 years ago I asked @Nuria for a roadmap on Wikistats 2. There wasn't going to be such a document.

Not everyone reads the analytics mailing list frequently, you can see here https://lists.wikimedia.org/mailman/listinfo how many lists we got. So for reaching out to the community to keep them in the loop Wikipedia Weekly and Signpost are useful channels. Wikipedia Weekly is the place where @Asaf announces that Turkish Wikipedia is back online, and I can add a Wikistats 2 chart to show this.

@Erik_Zachte Our roadmap for wikistats is on phabricator, same than the many other products we maintain , see two columns for wikistats: https://phabricator.wikimedia.org/tag/analytics/

We re-triage and map the work quaterly, just like we do for all our products. It is helfpul to look at the full picture to gauge the work being done. For example, wikistats reports are not the best way to monitor our traffic or detect anomalies, that work has been moved to automated anomaly reporting which we are also using for data quality: T235486: Hive data quality alarms pipeline

Don't throw away a working solution when a replacement is still vaporware.

This undermines efforts being done over the last couple of years to having a solution that scales, is available in mobile and desktop and provides a friendly UI for discovery. Wikistats2 has 2000 unique visitors daily and about 30,000 monthly. We get bug reports filed for it frequently. Now, has it fully replaced wikistats1? No, not yet. But it is a very real effort, buy no means vaporware, used by thousands every month.

@Nuria, thanks for the link. I will look at more depth to the task lists later this weekend.

My comments on vaporware weren't about Wikistats 2 as a whole, but foremost about the report on page views per wiki per month, which we were discussing, and where you said a rewrite could be considered. For that report I still think: don't throw away what works until replacement has been completed. But I can see how the reference I made to the survey about dump data based reports took it wider, for which I apologize. Part of the reports in the survey I mentioned have been replaced, parts have not yet, parts will never and that might be OK. I am impressed by the number of unique visitors to Wikistats 2 which you quote.

Milimetric moved this task from Paused to Next Up on the Analytics-Kanban board.

(Resetting inactive assignee account)

BTullis edited subscribers, added: odimitrijevic, BTullis; removed: fdans, ezachte.

It's a little over three years since the last update, so I'm revisiting this ticket and I'll try to reach consensus on what to do. We're still talking about 687 GB of data on stat1007 and whether or not to archive it to HDFS.

Apologies in advance if I misunderstand any of the technical details or context surrounding this data. First I'll just try to make sure that I have a clear understanding of how things are set up today; what's working and what isn't.

Current status

As per data.yaml, @Erik_Zachte still has an active shell account, although we don't have any records of recent logins to stat1007.

btullis@stat1007:~$ last ezachte -f /var/log/wtmp.1 
wtmp.1 begins Wed Jun  8 08:22:05 2022
Wikistats 1 & 2

These sites are both served by a single web server named: an-web1001.eqiad.wmnet

None of the data files comprising Wikistats 1 have been updated since 2019-11-12 as indicated by:

btullis@an-web1001:/srv/stats.wikimedia.org/htdocs$ sudo find . -type f -printf "%T+ %p\n" | sort|tail 
2019-04-09+04:00:35.3763458870 ./EN_Artificial/TablesPageViewsMonthlyMobile.htm
2019-04-09+04:00:39.6803634390 ./EN_Artificial/TablesPageViewsMonthlyOriginalMobile.htm
2019-04-09+04:00:43.6243795100 ./EN_Artificial/TablesPageViewsMonthlyCombined.htm
2019-04-09+04:00:47.5723955850 ./EN_Artificial/TablesPageViewsMonthlyOriginalCombined.htm
2019-04-09+04:00:47.6403958620 ./EN_Artificial/TablesPageViewsMonthlyAllProjects.htm
2019-04-09+04:00:47.7083961390 ./EN_Artificial/TablesPageViewsMonthlyAllProjectsOriginal.htm
2019-04-09+04:16:07.0119438100 ./archive/PageViewsPerDayAll.csv.zip
2019-11-12+15:44:03.7780239360 ./index_2019_11_12.html
2020-02-11+16:27:38.2366225780 ./index-v1.html.bak
2020-02-11+16:30:50.8473164870 ./index-v1.html
Dumps

The two machines named clouddumps100[1-2].wikimedia.org have the role dumps::distribution::server assigned.

This role includes the profile: profile::dumps::distribution::datasets::fetcher

...which instantiates the class: dumps::web::fetches::stat_dumps.

This class defines three jobs:

dumps::web::fetches::job { 'wikistats_1':
    source      => "${src}/wikistats_1",
    destination => "${miscdatasetsdir}/wikistats_1",
    delete      => false,
    minute      => '11',
    user        => $user,
}

dumps::web::fetches::job { 'pagecounts-ez':
    source      => "${src}/pagecounts-ez",
    destination => "${miscdatasetsdir}/pagecounts-ez",
    delete      => false,
    minute      => '21',
    user        => $user,
}

# Wiki Loves * (Monuments, Africa, Earth, etc.)
dumps::web::fetches::job { 'media-contestwinners':
    source      => "${src}/media/contest_winners",
    destination => "${miscdatasetsdir}/media/contest_winners",
    delete      => false,
    minute      => '31',
    user        => $user,
}

We can see what the rsync commands that are run by these three scripts are:

btullis@clouddumps1001:/srv/dumps$ cat /usr/local/bin/dump-fetch-*|grep stat1007
/usr/bin/rsync -rt  --chmod=go-w stat1007.eqiad.wmnet::srv/dumps/media/contest_winners/ /srv/dumps/xmldatadumps/public/other/media/contest_winners
/usr/bin/rsync -rt  --chmod=go-w stat1007.eqiad.wmnet::srv/dumps/pagecounts-ez/ /srv/dumps/xmldatadumps/public/other/pagecounts-ez
/usr/bin/rsync -rt  --chmod=go-w stat1007.eqiad.wmnet::srv/dumps/wikistats_1/ /srv/dumps/xmldatadumps/public/other/wikistats_1

The associated services are running on these hosts and completing successfully.

btullis@clouddumps1001:/srv/dumps$ systemctl status -n 2 dumps-fetch-wikistats_1.service dumps-fetch-pagecounts-ez.service dumps-fetch-media-contestwinners.service 
● dumps-fetch-wikistats_1.service - wikistats_1 rsync job
     Loaded: loaded (/lib/systemd/system/dumps-fetch-wikistats_1.service; static)
     Active: inactive (dead) since Wed 2023-03-29 20:11:01 UTC; 12min ago
TriggeredBy: ● dumps-fetch-wikistats_1.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 561657 ExecStart=/usr/local/bin/dump-fetch-wikistats_1.sh (code=exited, status=0/SUCCESS)
   Main PID: 561657 (code=exited, status=0/SUCCESS)
        CPU: 16ms

Mar 29 20:11:01 clouddumps1001 systemd[1]: dumps-fetch-wikistats_1.service: Succeeded.
Mar 29 20:11:01 clouddumps1001 systemd[1]: Finished wikistats_1 rsync job.

● dumps-fetch-pagecounts-ez.service - pagecounts-ez rsync job
     Loaded: loaded (/lib/systemd/system/dumps-fetch-pagecounts-ez.service; static)
     Active: inactive (dead) since Wed 2023-03-29 20:21:01 UTC; 2min 2s ago
TriggeredBy: ● dumps-fetch-pagecounts-ez.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 564997 ExecStart=/usr/local/bin/dump-fetch-pagecounts-ez.sh (code=exited, status=0/SUCCESS)
   Main PID: 564997 (code=exited, status=0/SUCCESS)
        CPU: 26ms

Mar 29 20:21:01 clouddumps1001 systemd[1]: dumps-fetch-pagecounts-ez.service: Succeeded.
Mar 29 20:21:01 clouddumps1001 systemd[1]: Finished pagecounts-ez rsync job.

● dumps-fetch-media-contestwinners.service - media-contestwinners rsync job
     Loaded: loaded (/lib/systemd/system/dumps-fetch-media-contestwinners.service; static)
     Active: inactive (dead) since Wed 2023-03-29 19:31:01 UTC; 52min ago
TriggeredBy: ● dumps-fetch-media-contestwinners.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 257253 ExecStart=/usr/local/bin/dump-fetch-media-contestwinners.sh (code=exited, status=0/SUCCESS)
   Main PID: 257253 (code=exited, status=0/SUCCESS)
        CPU: 14ms

Mar 29 19:31:01 clouddumps1001 systemd[1]: dumps-fetch-media-contestwinners.service: Succeeded.
Mar 29 19:31:01 clouddumps1001 systemd[1]: Finished media-contestwinners rsync job.

On the source end of these rsyncs, the most recent changes were in September 2020.

btullis@stat1007:/srv/dumps$ ls -l
total 12
drwxrwxr-x 3 ezachte wikidev 4096 Apr 14  2018 media
drwxrwxr-x 4 ezachte wikidev 4096 Apr 14  2018 pagecounts-ez
drwxrwxr-x 3 ezachte wikidev 4096 Jan  8  2019 wikistats_1
btullis@stat1007:/srv/dumps$ sudo find . -type f -printf "%T+ %p\n" | sort|tail
2020-09-18+04:23:51.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-17.bz2
2020-09-19+04:23:34.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-18.bz2
2020-09-20+04:21:33.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-19.bz2
2020-09-21+04:27:29.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-20.bz2
2020-09-23+04:29:55.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-21.bz2
2020-09-23+05:01:33.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-22.bz2
2020-09-24+04:23:50.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-23.bz2
2020-09-25+02:00:24.7681643140 ./pagecounts-ez/projectviews/projectviews-2020.tar
2020-09-25+03:42:03.8343742270 ./pagecounts-ez/projectviews/projectviews_csv.zip
2020-09-25+04:23:32.0000000000 ./pagecounts-ez/merged/2020/2020-09/pagecounts-2020-09-24.bz2

So far, so good. I've established to my own satisfaction that the following published datasets are still correctly synced from /:

What I haven't found, yet, is any further dependence on the files in /home/ezachte on stat1007.

There are no user crontabs or systemd timers on stat1007 that copy data from or manipulate data from /home/ezachte

btullis@stat1007:~$ sudo crontab -u ezachte -l
no crontab for ezachte
btullis@stat1007:~$ sudo grep -R ezachte /etc/cron.*
btullis@stat1007:~$ sudo grep -R ezachte /lib/systemd/system
btullis@stat1007:~$

@Erik_Zachte - perhaps you would know best? Are you still using these files? Would you like them retained in local storage on stat1007, or would you prefer them to be archived to HDFS? Would you prefer us to decline this ticket, now that you have regained access to them through your volunteer account?

If anyone else has any suggestions, or feels that I have misunderstood something, please do feel free to let me know.

Gehel lowered the priority of this task from Medium to Low.Dec 6 2023, 2:03 PM

I'm currently creating a tarball of all of the contexts of /home/ezachte on stat1007 and I will archive this to /wmf/data/archive/user/ezachte/ on HDFS when complete.

Copying an 807 GB tar file from stat1007 to /wmf/data/archive/user/ezachte on HDFS.

This has now completed.

btullis@stat1007:/home/ezachte$ hdfs dfs -put ezachte-T238243.tar /wmf/data/archive/user/ezachte
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
btullis@stat1007:/home/ezachte$

I'll now delete the user's home directories.

btullis@cumin1002:~$ sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/ezachte'
19 hosts will be targeted:
an-coord[1001-1004].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-master[1003-1004].eqiad.wmnet,an-test-client1002.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-master[1001-1002].eqiad.wmnet,stat[1004-1011].eqiad.wmnet
OK to proceed on 19 hosts? Enter the number of affected hosts to confirm or "q" to quit: 19
===== NO OUTPUT =====                                                                                                                                                                                              
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (19/19) [04:16<00:00, 13.51s/hosts]
FAIL |                                                                                                                                                                            |   0% (0/19) [04:16<?, ?hosts/s]
100.0% (19/19) success ratio (>= 100.0% threshold) for command: 'rm -rf /home/ezachte'.
100.0% (19/19) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
btullis@cumin1002:~$

Change #1029176 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Drop the deprecated dumps fetcher that pulls from stat1007

https://gerrit.wikimedia.org/r/1029176

Change #1029176 merged by Btullis:

[operations/puppet@production] Drop the deprecated dumps fetcher that pulls from stat1007

https://gerrit.wikimedia.org/r/1029176