[06:18:14] <inductiveload>	 nuria: is it possible to dig out specific metrics somehow?
[07:04:35] <elukey>	 btullis: o/ - morning! an-worker1106 showed as DOWN in icinga, I acked it so we'll be able to work on it when you are online :)
[07:12:09] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff)
[07:31:40] <btullis>	 elukey: thanks, I'm online now. 
[07:31:54] <elukey>	 good morning :)
[07:32:06] <elukey>	 nothing on fire it is just a host!
[07:32:19] <elukey>	 but if you haven't connected to the serial console etc.. yet we can do it
[07:33:34] <btullis>	 Perfect, thanks. In a Meet, or in IRC chat?
[07:35:36] <elukey>	 we can do it in here if you are ok!
[07:35:56] <elukey>	 i can explain what I usually do
[07:36:08] <elukey>	 do you have your pwstore repo ready with gpg key?
[07:36:09] <btullis>	 Yep, cool. I see the host in Icinga. Do we make a ticket for it?
[07:36:45] <elukey>	 We can, I usually check quickly to see if anything is really broken or if it is transient (like if a reboot fixes)
[07:36:50] <btullis>	 Yes, I have pws ready I think.
[07:37:57] <elukey>	 perfect
[07:38:11] <elukey>	 so in there there are two files that contains passwords that we need
[07:38:24] <elukey>	 1) 'management' -> this is the pass to connect to the serial console
[07:38:40] <elukey>	 2) root_password -> this is to login as 'root' when you have a tty basically
[07:39:09] <elukey>	 I usually just gpg --decrypt management when I need the pass
[07:39:14] <elukey>	 but pws works as well
[07:40:26] <elukey>	 when I have the management pass what I usually do is:
[07:40:35] <elukey>	 1) connect to cumin1001.eqiad.wmnet and tmux/screen
[07:40:58] <elukey>	 2) do an ssh like: ssh root@an-worker1106.mgmt.eqiad.wmnet
[07:41:18] <elukey>	 that should lead to a password prompt, that requires the management password
[07:41:45] <btullis>	 OK, one sec. Trying to recall gpg passphrase under pressure now. :-)
[07:42:02] <elukey>	 once in, you are either in the Dell DRAC or in the HP iLO (IIRC the naming)
[07:42:21] <elukey>	 ahahhaha nono please the contrary of under pressure, nothing is really on fire
[07:45:30] <btullis>	 Phew! Got it. No, it would just be the embarrassment of it.
[07:47:17] <btullis>	 OK, I've got a `racadm>>` prompt.
[07:47:20] <elukey>	 nice!
[07:47:51] <elukey>	 so we have on wikitech a lot of infos about what commands you can write, but the most useful ones for DELL that I use are
[07:48:05] <elukey>	 2) racadm getsel (to get various events etc.. like PSU failure, CPU broken, etc..)
[07:48:11] <elukey>	 err 1) sorry :D
[07:48:15] <elukey>	 2) console com2
[07:48:48] <elukey>	 the latter brings you to the serial console
[07:49:03] <elukey>	 and ctrl+\ should exit it
[07:50:02] <mforns>	 hi team! looking into webrequest alerts
[07:51:01] <btullis>	 Cool. Nothing relevant in the log.  The latest is 09/15/2020. Will start a console now.
[07:52:15] <elukey>	 hola mforns 
[07:52:17] <btullis>	 Kernel spinning. Showing soft CPU lockups. Scrolling a bit quickly to get any more at the moment.
[07:52:28] <mforns>	 :)
[07:52:42] <elukey>	 btullis: yeah exactly, the soft lockup sometimes happens for reasons that I never really understood
[07:53:06] <elukey>	 it is interesting to see the side effects on hadoop metrics too
[07:53:15] <elukey>	 https://grafana-rw.wikimedia.org/d/000000585/hadoop?orgId=1 -> Namenode panel (basically the HDFS master)
[07:53:36] <elukey>	 you will find a "Under replicate blocks" graph
[07:53:40] <elukey>	 *replicated
[07:54:28] <elukey>	 each HDFS block is replicated three times, so after a worker goes down for 10 mins (IIRC) the Namenode doesn't get health checks anymore and flags it as down
[07:54:43] <elukey>	 and asks to the other replicas to stream data to a new (live) one
[07:55:10] <elukey>	 this stops immediately when the node down gets back to life
[07:55:30] <elukey>	 in this case, the usual fix for me is 'racadm serveraction powercycle'
[07:55:34] <btullis>	 Yes I see.
[07:55:48] <elukey>	 I usually !log it on #operations for visibility
[07:56:14] <elukey>	 once the host gets back up, you'll see the replicated blocks metric recovering etc..
[07:56:44] <elukey>	 that's it :)
[07:58:15] <btullis>	 Gotcha, thanks. Will the racadm powercycle do an ACPI graceful shutdown if it can, or is it simply off->on ? I've been more used to Supermicro IPMI and `ipmitool` recently.
[07:58:53] <btullis>	 Also, do we monitor under-replicated blocks in Icinga or anything?
[08:03:15] <btullis>	 Ah, seems my Icinga privileges need updating, because I couldn't add a comment there. I thought I'd done that.
[08:06:42] <elukey>	 btullis: yes it is a graceful shutdown, there is a more brutal restart but I don't recall the syntax
[08:07:14] <elukey>	 for the under-replicated blocks, we haven't in the past but we added a lot of monitors for Namenode metrics (you can find those in puppet)
[08:07:35] <elukey>	 all of them have a related grafana panel + https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts
[08:07:53] <elukey>	 so it is sufficient to click on the links provided by the alarm to know what's going on
[08:09:31] <elukey>	 we have alarms for the more problematic corrupt/missing blocks
[08:09:56] <elukey>	 in theory we can add an alarm for under-replicated blocks, but in my opinion it should be dynamic based on the total number of blocks
[08:10:05] <elukey>	 something like 30/40% or similar
[08:10:31] <elukey>	 but the only use case that I have in mind to trigger the alert would be a ton of workers down
[08:10:42] <elukey>	 and we'd notice it from a lot of other alarms
[08:10:52] <elukey>	 (this is probably why we never added one)
[08:11:36] <btullis>	 OK, fab. Thanks. It's booted cleanly. Under-replicated blocks has dropped immediately to zero, as you said it would. All looks OK. 
[08:13:13] <btullis>	 So in this case no phab ticket required, but just mental note that this known problem has happened once again and required intervention. If it hadn't behaved as expected with a simple reboot, then I would create a ticket for triage. Something like that?
[08:17:47] <elukey>	 yes exactly
[08:18:02] <elukey>	 say a DIMM broken, cpu broken, etc..
[08:18:18] <elukey>	 we have automation (SRE automation I mean) that creates tasks for broken disks
[08:18:27] <elukey>	 but not for the rest (it is a little more difficult)
[08:21:30] <elukey>	 re: ipmitool, I forgot to add a note that we can use it as well
[08:21:41] <elukey>	 there should be documentation on wikitech
[08:27:49] <btullis>	 re: the automation. Is this a cookbook, or something else? I can't seem to find such a script in the cookbooks, nor in the netbox scripts. Wondering if there is another place for sre automation scripts that I should know about.
[08:28:48] <elukey>	 btullis: ah yes it should be in puppet, it is a a check that fires when nagios detects a broken disk
[08:29:19] <elukey>	 https://phabricator.wikimedia.org/T285643 is an example
[08:30:27] <btullis>	 Ah nice. A fully automated automation. :-) Will check that out. Thanks again for all the info.
[08:32:15] <elukey>	 np!
[08:32:32] <mforns>	 !log restarted webrequest bundle (messed up a coord when trying to rerun some failed hours)
[08:32:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:49:27] <joal>	 mforns: Hi! Thanks a lot for checking and rerunning - I think I have found the problem leading to more data-loss now that we use gobblin :)
[08:49:41] <joal>	 Adding you as a reviewer mforns 
[09:07:03] <addshore>	 this probably sound evil, but is it possible to run PHP code in hadoop cluster?
[09:07:24] <joal>	 addshore: I'll fake not having seen that message :D
[09:07:36] <addshore>	 xD
[09:07:44] <addshore>	 hear me out! :P https://phabricator.wikimedia.org/T94019
[09:08:14] <joal>	 I definitely hear you - this would a hell lot of sense
[09:08:21] <addshore>	 I was generally thinking that if hadoop had up to date external JSON representations of Wikidata entites, 1) we could generate JSON dumps from there and 2) we could use the same json -> rdf mapping code to make the rdf fumps
[09:08:30] <addshore>	 while keeping the mapping code in Wikibase in PHP
[09:09:26] <joal>	 I'm not sure about how we can get "up to date external JSON representations" - I assume this means getting the wikidata content, right?
[09:09:34] <addshore>	 yeah
[09:09:51] <addshore>	 I was thinking, some event post edit -> kafka -> hadoop? but not sure about that part either?
[09:09:56] <joal>	 ok - we're not yet ready on that front, btu we have a plan (but no resource to put it in practice)
[09:10:08] <addshore>	 okay
[09:10:37] <joal>	 the plan would be: edit--> kafka --> streaming job --> WMF-API(get content) --> store on HDFS
[09:11:11] <addshore>	 aaah okay, rather than having the content in the stream
[09:11:16] <joal>	 And there is one issue: we have incomplete edits in kafka currently, and a task force issue (we don't have the bandwisth)
[09:11:36] <addshore>	 aah, this is the "reliable events" ticket right?
[09:11:41] <joal>	 correct sir
[09:11:50] <addshore>	 it always comes back to that one! :P
[09:12:25] <addshore>	 So https://phabricator.wikimedia.org/T120242 ?
[09:12:35] <joal>	 The WDQS-updater relies on the edits stream, so we could do it (they already do almost exactly what I describe), but IMO it's worth waiting the solution to reliable events (or more reliable should I sa)
[09:12:50] <addshore>	 okay *goes to write a comment*
[09:13:53] <joal>	 this is the ticket yes - I think we could even not do what the ticket advertise as possible solutions as long as the missing events in streams is low enough (currentl at ~1%, far too high)
[09:15:03] <joal>	 And then about running PHP on hadoop - I have not yet seen it done, but I don't see why it would be feasible - The java (or python) haddop worker starts an external PHP process, feeds it the content and gets the result back, then finalizes its process
[09:15:13] <joal>	 not great but doable
[09:15:34] <joal>	 It'll also require having the PHP libs etc on hadoop workers (not the case now)
[09:15:55] <addshore>	 but also with the text diagram you mentioned above, if the content comes from the MW api anyway, then the conversion happens in the app, rather than in haddop land
[09:16:07] <addshore>	 with #worksforme
[09:16:08] <addshore>	 *which
[09:16:28] <joal>	 hm, which app?
[09:16:48] <addshore>	 well, mediawiki / wikibase
[09:17:09] <joal>	 Ah - We can ask the wikibase-API to give us json is that rght?
[09:17:29] <joal>	 Maybe even RDF?
[09:17:53] <addshore>	 json,. rdf, ttl, jsonld, etc
[09:17:59] <joal>	 right
[09:18:13] <joal>	 So a dedicated wikibase job extracting info in streaming could do
[09:18:33] <joal>	 I think that's what the WDQS-updater does
[09:18:37] <addshore>	 indeed, we could even provide multiple formats via a single call for optimizations sake 
[09:18:47] <addshore>	 yes, the streaming updater goes to wikibase and gets the TLL afaik
[09:18:50] <joal>	 yup, call once, save multiple
[09:18:51] <addshore>	 *ttl
[09:26:34] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Discovery, 10Wikimedia-production-error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10Aklapper) Please don't add project tags as subscribers bu...
[09:26:45] <wikibugs>	 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10WMDE-TechWish-Sprint-2021-07-07: Backfill metrics for TemplateWizard and VisualEditor - https://phabricator.wikimedia.org/T274988 (10WMDE-Fisch) 05Open→03Resolved
[09:45:07] <elukey>	 joal: sorryyyy I was in a meeting, added a comment to you code review for gobblin
[09:48:28] <joal>	 np elukey :) thank you!
[09:52:52] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10aborrero)
[09:54:27] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10aborrero)
[10:08:18] <elukey>	 joal: deployed
[10:08:21] <joal>	 \o/
[10:08:43] <joal>	 thanks a lot elukey - I'm gonna double check next hour
[11:17:44] <joal>	 elukey: all good from camus - many thanks!
[11:17:50] <joal>	 s/camus/gobblin
[11:59:57] <elukey>	 goooood
[13:22:57] <fab>	  Hello. We need to make about ~450GB of data (currently on hdfs, non-PII) available for download for a competition. Can we use the dataset release instructions for this purpose? https://wikitech.wikimedia.org/wiki/Analytics/Web_publication
[13:35:31] <elukey>	 fab: o/ 450GB before replication right?
[13:35:40] <elukey>	 (asking to double check)
[13:37:55] <fab>	 yes without replication. generating ~1gb files, so there would be about 450 files that would end up in the public html directory.
[13:42:02] <elukey>	 would it be for a limited amount of time?
[13:42:07] <elukey>	 I mean not months
[13:42:36] <elukey>	 we currently have space of thorium (where we serve files from) but the host will be decommed and replaced with another one with less capacity
[13:47:17] <fab>	 It is for a limited amount of time, I will check the planned timeline - there are still some open questions on where the different type of datasets will be hosted
[13:49:24] <elukey>	 I am a little worried about network bandwidth for such a big file, if multiple requests come in at the same time thorium will surely go under a little stress easily
[13:49:44] <elukey>	 (plus downloads will become a lot slow)
[13:50:58] <elukey>	 ideally if the dataset is public we could serve it from something like commons/swift/s3, not sure if it is a valid use case or people did it before
[13:59:50] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ArielGlenn)
[14:03:05] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff)
[14:09:33] <wikibugs>	 10Analytics, 10Event-Platform, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Automate event stream ingestion into HDFS for streams that don't use EventGate - https://phabricator.wikimedia.org/T273901 (10Zbyszko) @Ottomata can this be closed?
[14:19:32] <wikibugs>	 10Analytics, 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Zbyszko) 05Open→03Resolved a:03Zbyszko Strategy was developed and is being implemented.
[14:35:12] <razzi>	 Hi team, going to start disabling jobs on an-launcher for the hadoop master debian upgrade. In about 30 minutes when it's time to reimage I'll hop in the batcave so btullis and anybody else who wants to can see the commands
[14:36:55] <elukey>	 razzi: good morning! +1 for the cluster drain, but in ~25 mins there will be network maintenance in row D so we need to wait for it to be finished
[14:37:34] <joal>	 question elukey and razzi: wouldn't it be cool to have the cluster drained before the row D maintenance?
[14:38:00] <elukey>	 joal: IIUC this is what Razzi is going to do now
[14:38:22] <joal>	 Great - was not sure if the idea was to postpone draining after the row D
[14:38:48] <elukey>	 nono what I suggested was to wait for the reimage
[14:39:00] <joal>	 perfect thanks :)
[14:40:37] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10jbond) I'm curious why the intention is to configure this using a analytics-presto.eqiad.wmnet  CNAME instead of  a analytics-pre...
[14:40:50] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 4 host(s) and their services with reason: eqiad row D maintenance ` cp[1...
[14:42:21] <btullis>	 razzi: I'm aware of: https://phabricator.wikimedia.org/T278423#7190372 but out of interest, how is the cluster drained?
[14:42:39] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Discovery, 10Wikimedia-production-error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10mforns) Uou, yea, my bad. Thanks for the heads up!
[14:43:05] <elukey>	 btullis: first two steps basically
[14:43:11] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Vgutierrez)
[14:43:44] <elukey>	 we stop all the systemd timers, including the ones importing data, and eventually most of the recurrent jobs will halt (including the oozie ones, since no new data will be available)
[14:44:02] <elukey>	 some other jobs will be left running (say user-create ones etc..)
[14:45:19] <razzi>	 Thanks for answering btullis 's question elukey, still getting my day started but now I'm here!
[14:45:43] <btullis>	 Oh I see. 3rd bullet point. "Wait 30 minutes for applications to gracefully exit". Gotcha, I missed that.
[14:45:58] <razzi>	 Going to start by disabling puppet on an-launcher. Already announced maintenance here and in product-analytics slack channel
[14:46:30] <razzi>	 !log razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
[14:46:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:47:06] <razzi>	 Now going to stop timers, excuse the !log spam
[14:47:19] <razzi>	 Well actually I don't need to !log every line this time
[14:47:35] <razzi>	 !log Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372)
[14:47:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:48:27] <razzi>	 Hmm I accidentally did:
[14:48:28] <razzi>	 razzi@an-launcher1002:~$ sudo systemctl stop drop_event
[14:48:28] <razzi>	 Warning: Stopping drop_event.service, but it can still be activated by:
[14:48:28] <razzi>	   drop_event.timer
[14:48:34] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance ` dns1...
[14:48:59] <elukey>	 razzi: you'd need to stop the .timer unit, not the .service ones
[14:49:02] <razzi>	 So then I ran sudo systemctl stop drop_event.timer
[14:49:02] <razzi>	 I hope I didn't mess anything up by stopping the service rather than the timer, but should b efine
[14:49:03] <razzi>	 yep
[14:49:17] <razzi>	 Now we wait :)
[14:49:34] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Vgutierrez)
[14:49:51] <elukey>	 razzi: gobblin timers are still up :)
[14:50:17] <elukey>	 also best to stop analytics-reportupdater-logs-rsync.timer too
[14:51:12] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance ` lvs1...
[14:51:14] <razzi>	 oh welcome to the party gobblin!
[14:51:37] <razzi>	 !log sudo systemctl stop analytics-reportupdater-logs-rsync.timer
[14:51:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:52:42] <razzi>	 !log sudo systemctl stop 'gobblin-*.timer'
[14:52:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:54:21] <razzi>	 Here is the list of running apps, need this to get to 0: https://yarn.wikimedia.org/cluster/apps/RUNNING
[14:54:25] <razzi>	 Currently 18
[14:55:30] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[14:56:33] <elukey>	 network maintenance for row D is about to start
[14:59:28] <razzi>	 Alright, one more thing we should do before network maintenance: disable yarn queues https://gerrit.wikimedia.org/r/c/operations/puppet/+/705698
[14:59:39] <razzi>	 Then we can chill out for a bit
[15:00:13] <razzi>	 elukey: do you mind reviewing that patch?
[15:05:17] <razzi>	 It's pretty low risk, I've done a patch like that before, I'm going to self +2
[15:05:32] <topranks>	 hey all
[15:05:42] <topranks>	 I see you are doing things there - thanks!
[15:05:49] <topranks>	 let me know when we are good to go :)
[15:06:08] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[15:06:15] <razzi>	 AFAIK we're good to go now, yarn queues are disabled
[15:06:20] <razzi>	 elukey: any opinion?
[15:06:46] <btullis>	 topranks: Agree. good to go.
[15:07:11] <elukey>	 razzi: sorry I was afk
[15:07:21] <elukey>	 +1
[15:07:24] <fab>	 Thanks elukey, that doesn't sound like a great solution for our use case. Can you elaborate on the bandwidth and the expected problems? Is the https://dumps.wikimedia.org/ hosted on the same servers, or could we possibly use that? A large part of the data are commons image bytes, which we eventually want to offer for bulk download on our servers as well.
[15:07:25] <razzi>	 cool cool
[15:07:27] <topranks>	 Ok great - thanks :)
[15:07:39] <topranks>	 Updates in #wikimedia-sre, hopefully be uneventful
[15:08:21] <razzi>	 !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
[15:08:22] <elukey>	 fab: the files rsynced from stat100x end up on a single node, called "thorium", that has a 1G NIC
[15:08:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:08:28] <razzi>	 Still had to apply puppet changes, all set now
[15:08:41] <razzi>	 thanks for checking in topranks !
[15:09:09] <elukey>	 fab: the host is not designed to serve a lot of big files and scale accordingly, this is why I was mentioning the bw issue
[15:10:09] <elukey>	 razzi: all the applications running are user-related ones, so they will likely stay up unless you kill them
[15:10:50] <elukey>	 and mjolnir jobs may take a long time to complete sigh
[15:10:57] <elukey>	 see https://yarn.wikimedia.org/cluster/app/application_1623774792907_176079
[15:11:27] <razzi>	 elukey: should we kill them now or wait for row D maintenance to end, or does it not really matter?
[15:11:36] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[15:11:43] <fab>	 elukey agreed, that is not a good solution for this use case.  are you aware of any infrastructure that is suitable for this at wmf, maybe dumps.wikimedia.org?
[15:11:56] <elukey>	 razzi: it doesn't really matter, you can proceed with killing the jupyter notebooks in theory
[15:12:55] <razzi>	 Ok cool so just keep making progress so when maintenance is done we can reimage
[15:13:08] <razzi>	 Safe progress :)
[15:14:01] <elukey>	 fab: not sure what it would be best, do you have a tight deadline for this? I can ask if Swift may be used, or something similar.. 
[15:14:12] <elukey>	 fab: best is probably to open a task
[15:15:53] <elukey>	 razzi: I'd also follow up in the #search channel about mjolnir
[15:16:31] <razzi>	 sg re: mjolnir
[15:16:45] <razzi>	 looking at the running applications, I don't see any jupyter ones...
[15:17:10] <razzi>	 unless wmfdata-yarn-regular is the generic name for that sort of thing
[15:17:32] <elukey>	 exactly yes, in theory it should be a notebook
[15:18:54] <razzi>	 Is there a way to know for sure? I guess I could stop the jupyter processes themselves
[15:19:09] <razzi>	 but I'd rather not, stick to the plan and just touch yarn jobs themselves
[15:20:00] <elukey>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter-SWAP#The_easiest_way:_wmfdata
[15:20:59] <elukey>	 I am pretty sure those are notebooks
[15:21:34] <elukey>	 anyway, those are probably not actively running anything
[15:21:44] <btullis>	 fab: elukey: This might be a silly idea, but would bittorrent be useful? I can see that we *unofficially* seed some dumps from tools.wmflabs.org here: https://meta.wikimedia.org/wiki/Data_dump_torrents
[15:21:46] <elukey>	 the main problem are the search jobs
[15:22:42] <elukey>	 btullis: it looks a promising way, never heard/done it before!
[15:24:35] * joal likes the bittorrent idea!
[15:25:53] <dcausse>	 mjolnir* jobs are OK to be killed as they might still run for long
[15:25:56] <razzi>	 elukey: checked in with dcausse in search and we're good to stop mjolnir
[15:25:57] <razzi>	 yep yep
[15:26:16] <razzi>	 to your knowledge elukey can we kill everything and proceed?
[15:26:19] <dcausse>	 going to stop flink
[15:26:43] <elukey>	 dcausse: <3
[15:26:45] <razzi>	 I'm in the batcave now for anybody who wants to hang out
[15:27:15] <topranks>	 We're done with our change, negligible impact - "almost too quiet" - but looks like it went in ok.
[15:27:53] <joal>	 fab: I'm assuming that dumps.wikimedia.org servers would be the easiest, but it also is limited in term of bandwidth - users can only download 2 files at a time
[15:28:40] <elukey>	 razzi: let's make sure that the non-notebook jobs are not running, the rest should be killable (but I'll defer the final word to joal)
[15:28:48] <joal>	 fab: if the use case is to allow a lot (define a lot?) of people to download this amount of data fast (define fast), maybe a different approach is needed
[15:29:22] <dcausse>	 flink is stopped, everything remaining on analytics-search can be killed
[15:31:37] <razzi>	 joal: can you comment on if any jobs in https://yarn.wikimedia.org/cluster/apps/RUNNING need to stay running? Feel free to join me and btullis in the batcave to discuss :)
[15:31:49] <joal>	 joining!
[15:37:00] <razzi>	 !log kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
[15:37:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:39:12] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) @jbond - I like the DNS Discovery idea in principle, but that page seems to me to suggest that it is more geared up for...
[15:42:07] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) I like the look of the new PKI methods. I'll try that and tag you for code review @jbond. Thanks.
[15:42:29] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10jbond) @BTullis yes the page is definitely written with multi site in mind but AFAIK it works fine with just one site.   Either w...
[15:43:05] <addshore>	 o/ any idea if we need to wait for https://phabricator.wikimedia.org/T286655 to be done before we could merge the refered to schemas?
[15:44:18] <razzi>	 All but 2 jobs have been killed; analytics-search ones are strangely not going away when we sudo -u analytics kerberos-run-command analytics yarn application -kill them
[15:45:52] <razzi>	 joal was able to kill them as a member of analytics-admin group :)
[15:47:08] <razzi>	 apparently "analytics" is not not a member of "analytics-admins" !?
[15:47:13] <razzi>	 We can circle back on this later
[15:47:49] <elukey>	 razzi: analytics is not a superuser
[15:47:51] <elukey>	 hdfs is
[15:48:04] <elukey>	 analytics is a member of analytics-privatedata-users
[15:48:11] <razzi>	 tried running the command as hdfs too, same issue
[15:48:15] <elukey>	 (and it shouldn't be a super user)
[15:48:42] <elukey>	 have you tried with 'yarn' ?
[15:48:46] <elukey>	 sudo -u yarn etc.. ?
[15:49:01] <elukey>	 it is probably due to the capacity scheduler's ACLs
[15:49:40] <razzi>	 joal tried with yarn, and I guess there's no yarn keytab so it failed
[15:49:54] <razzi>	 but yeah he showed me the xml and I guess we should make a patch for it
[15:49:59] <razzi>	 the config xml, where acls are
[15:50:14] <elukey>	 the yarn keytab is where yarn runs, an-master and workers
[15:50:24] <elukey>	 so kerberos-run-command for yarn needs to be execute on those nodes
[15:50:50] <razzi>	 gotcha
[15:51:08] <razzi>	 going to proceed with enabling safe mode, yarn is already empty
[15:52:15] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) All works complete, no signs of any issues really, I had no ping loss on 16 pings towards 2 hosts connected off each member switch.  Very h...
[15:52:35] <razzi>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
[15:52:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:52:43] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10elukey) The discovery records may be a good path forward, this task is following what we did for analytics-hive.eqiad.wmnet. We h...
[15:52:49] <wikibugs>	 (03PS3) 10MewOphaswongse: Add a link: Update schema to support edit mode toggle [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115)
[15:54:27] <elukey>	 razzi: how did it go?
[15:54:46] <elukey>	 from the logs I didn't see explosions
[15:55:04] <razzi>	 oh !! I forgot to checkpoint :)
[15:55:21] <elukey>	 also, 1002 is active right now
[15:55:31] <elukey>	 did we ever fallback to 1001 ?
[15:55:46] <razzi>	 no, I never switched back
[15:55:49] <razzi>	 probably should have
[15:56:09] <elukey>	 let's saveNamespace
[15:56:30] <elukey>	 and then copy the data from 1002 to the backup host
[15:56:50] <razzi>	 yep yep
[15:57:02] <razzi>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
[15:57:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:59:20] <razzi>	 Save namespace successful for an-master1001.eqiad.wmnet/10.64.5.26:8020
[15:59:22] <razzi>	 \o/
[15:59:30] <razzi>	 still running though
[15:59:34] <razzi>	 don't want to celebrate too soon...
[15:59:50] <razzi>	 ok! Save namespace successful for an-master1002.eqiad.wmnet/10.64.21.110:8020
[16:00:55] <elukey>	 good
[16:01:01] <elukey>	 let's backup :)
[16:03:14] <razzi>	 yep!
[16:03:15] <razzi>	 !log root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
[16:03:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:04:09] <mforns>	 fdans: standup?
[16:07:59] <razzi>	 tar is done
[16:10:53] <razzi>	 !log razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
[16:10:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:11:09] <razzi>	 I had the wrong host in the comment but I fixed it before running (should be 1002)
[16:11:33] <razzi>	 transfer should take a few minutes, will post when it's done
[16:14:52] <razzi>	 ok transfer is done
[16:15:12] <razzi>	 going to stop hadoop processes on an-master1001
[16:15:22] <razzi>	 (after standup, which is just about done)
[16:18:57] <razzi>	 !log sudo systemctl stop hadoop-hdfs-namenode
[16:19:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:19:03] <razzi>	 on an-master1001
[16:19:31] <razzi>	 !log razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager
[16:19:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:19:49] <razzi>	 !log razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc
[16:19:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:23:43] <razzi>	 I think I forgot to disable puppet, so the namenode process started again; re-disabled puppet and stopped it again
[16:23:56] <razzi>	 !log sudo systemctl stop hadoop-hdfs-namenode on an-master1001
[16:23:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:25:01] <razzi>	 !log sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again
[16:25:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:25:30] <razzi>	 !log sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again
[16:25:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:25:57] <razzi>	 !log razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver
[16:25:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:26:06] <razzi>	 ok all the namenode processes are stopped again
[16:26:56] <icinga-wm>	 PROBLEM - Hadoop ResourceManager on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process
[16:27:13] <elukey>	 no downtime :)
[16:27:33] <razzi>	 ok, icinga alert is fine
[16:27:51] <razzi>	 ps auxf | egrep 'hdfs|yarn|hadoop' came up empty!
[16:27:51] <elukey>	 yes but others will come soon if you don't downtime
[16:27:57] <razzi>	 gotcha!
[16:28:06] <razzi>	 Should have put that in the steps
[16:28:14] <icinga-wm>	 PROBLEM - Hadoop HDFS Zookeeper failover controller on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_ZKFC_process
[16:28:33] <razzi>	 here come the alerts...
[16:29:22] <razzi>	 !log razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade"
[16:29:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:29:33] <joal>	 milimetric: an example of working command for unique_devices_daily job --> https://gist.github.com/jobar/239b25c3d8ca9cdf26d51536d4f0208c
[16:31:30] <razzi>	 Running uid script on an-master1001
[16:31:58] <razzi>	 !log sudo bash gid_script.bash on an-maseter1001
[16:32:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:33:03] <milimetric>	 joal: you were talking about 0 as the _tid yesterday, but what's up with 13814000-1dd2-11b2-8080-808080808080?
[16:33:35] <joal>	 hehe milimetric :) `Uuid.fromStart(0) --> 13814000-1dd2-11b2-8080-808080808080`
[16:34:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:34:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:34:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1109 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:51] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:51] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:56] <mforns>	 fdans: I have a couple questions about adding the languages to Wikistats2
[16:35:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:59] <elukey>	 razzi: ---^
[16:36:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:14] <fab>	 elukey joal - after discussing we are now looking to use thorium for a transfer, i.e. there will be few (possibly only one) downloads per file - and after the transfer the data can be removed. Is this a workable solution from your perspective? 
[16:36:17] <fdans>	 mforns: tardis thing? i can go if you pass me the link
[16:36:24] <milimetric>	 https://meet.google.com/kti-iybt-ekv?pli=1&authuser=1
[16:36:27] <milimetric>	 https://meet.google.com/kti-iybt-ekv?pli=1
[16:36:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:33] <mforns>	 fdans: ok! https://meet.google.com/kti-iybt-ekv
[16:36:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:54] <elukey>	 razzi: ping ?
[16:37:15] <fab>	 btullis, thanks for the bittorrent recommendation, I do like that idea - for now we hope to avoid hosting the data ourselves.
[16:37:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:37:21] <razzi>	 hmm
[16:37:24] <joal>	 elukey: we're in the batcave talking about the issue
[16:37:38] <razzi>	 sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet comes back active, so I'm not sure what the problem is
[16:37:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:37:56] <elukey>	 razzi: have you checked after stopping the yarn resource manager on 1001 that 1002 ended up active?
[16:37:58] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:04] <joal>	 elukey: an-master1001 processes are down, an-master1002 should have picked - there is something bizarre :(
[16:38:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:39:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:39:06] <razzi>	 brb bathroom
[16:39:11] <razzi>	 there's a lot of noise but we're in safe mode
[16:39:19] <razzi>	 and backed up :)
[16:39:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:40:54] <elukey>	 joal: the node manager failed to contact an-master1002 for some reasons
[16:40:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:40:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:40:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:40:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:41:01] <joal>	 yup elukey 
[16:41:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:41:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:41:22] <joal>	 running  "yarn rmadmin -getServiceState an-master1002-eqiad-wmnet" failed
[16:41:50] <elukey>	 when?
[16:41:54] <joal>	 now
[16:41:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:31] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:58] <razzi>	 elukey: feel free to join us in the batcave if you'd like to listen to us troubleshoot
[16:44:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:44:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:44:29] <elukey>	 razzi: I communicated the issue to SRE, please do it next time since people were wondering why we had a storm of alerts
[16:44:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:54] <joal>	 elukey: problem is due to yarn trying to use HDFS while it is in safemode (for node-labels)
[16:45:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:46:16] <joal>	 elukey: I suggest moving out of safemode to let RM recover, then back again
[16:46:26] <joal>	 elukey: any counter opinion?
[16:46:48] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:46:55] <elukey>	 joal: +!
[16:46:56] <elukey>	 +1
[16:47:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:48:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:01] <elukey>	 joal: since alerts are spamming, let's either exit safe mode now or downtime hsots
[16:49:04] <elukey>	 *hosts
[16:49:13] <joal>	 ack elukey - trying to solve
[16:49:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:51:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:19] <razzi>	 !log starting hadoop processes on an-master1001 since they didn't failover cleanly
[16:52:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:52:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:52] <elukey>	 razzi: it is sufficient to just exit safemode
[16:53:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:53:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:53:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:54:05] <razzi>	 elukey: exiting safemode didn't work
[16:54:06] <joal>	 elukey: we're having issues exiting safe-mode
[16:54:11] <joal>	 elukey: shall we force?
[16:54:12] <razzi>	 razzi@an-master1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
[16:54:12] <razzi>	 safemode: Call From an-master1002/10.64.21.110 to an-master1001.eqiad.wmnet:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
[16:55:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:55:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:55:11] <razzi>	 Ok, exiting safemode worked just now
[16:55:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:55:33] <icinga-wm>	 RECOVERY - Hadoop HDFS Zookeeper failover controller on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_ZKFC_process
[16:55:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:55:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:56:02] <wikibugs>	 10Analytics: Create aggregate alarms for Hadoop daemons running on worker nodes - https://phabricator.wikimedia.org/T287027 (10elukey)
[16:56:29] <elukey>	 razzi: ah ack I didn't know it
[16:56:34] <elukey>	 I created a follow up task --^
[16:56:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:57:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:58:01] <joal>	 elukey: new corner case with node-labels :S
[16:58:33] <elukey>	 joal: sneaky corner case, we can switch to files-on-host IIRC
[16:58:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:58:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:58:54] <joal>	 elukey: I think that would be great! will try to read about that
[16:59:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:59:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:59:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:59:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:00:04] <joal>	 elukey: are you ok with us proceeding on reimaging while not in safemode (due to the problem)?
[17:00:10] <elukey>	 +1
[17:00:12] <joal>	 ack
[17:00:15] <joal>	 thank you a lot :)
[17:00:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:00:52] <elukey>	 joal: one thing though - on 1001 all the daemons are up, let's wait for full bootstrap before shutting all down again
[17:01:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:01:08] <joal>	 ack elukey - will make sure this is the case
[17:01:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:01:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:02:06] <icinga-wm>	 RECOVERY - Hadoop ResourceManager on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process
[17:02:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:02:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:02:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:02:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:02:44] <elukey>	 btullis: if you want to run 'cumin 'A:hadoop-worker' 'run-puppet-agent' -b 10 to speed up recoveries you can do it
[17:03:06] <elukey>	 (forces a puppet run in batches of 10, that brings up the nodemanagers down)
[17:03:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:03:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:04:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:04:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:04:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:05:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:05:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:06:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:06:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:06:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:07:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:07:30] <joal>	 razzi: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=46&orgId=1&from=now-3h&to=now
[17:08:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:08:44] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Andrew)
[17:08:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:08:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:08:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:08:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:11:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:13:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:13:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:15:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:15:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:17:02] <razzi>	 !log stop all hadoop processes on an-master1001
[17:17:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:18:14] <razzi>	 ok good, all processes are stopped, I don't see any alerts exploding
[17:18:47] <razzi>	 razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -getServiceState an-master1002-eqiad-wmnet
[17:18:47] <razzi>	 active
[17:19:03] <elukey>	 check also hdfs just to be sure
[17:19:13] <joal>	 ack elukey 
[17:19:39] <elukey>	 but I don't see any java process on 1001 so we should be good :
[17:19:40] <elukey>	 :)
[17:19:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:19:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1064 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:21:34] <razzi>	 razzi@an-master1002:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
[17:21:34] <razzi>	 active
[17:21:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1109 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:22:08] <razzi>	 Going to run the gid script again, just in case the messiness earlier caused it to fail somehow
[17:22:13] <razzi>	 then should be good to reimage
[17:22:35] <elukey>	 razzi: not sure if it works after changing 
[17:23:02] <elukey>	 if you ran the script before it should be fine
[17:23:11] <razzi>	 sure thing, I ran it again, and it caused the same output
[17:23:26] <razzi>	 next time I'll be more careful to check that such a script is explicitly idempotent, but I think we're ok this time
[17:23:50] <elukey>	 yes yes no problem, lemme check files
[17:23:53] <razzi>	 cool
[17:24:09] <razzi>	 I'm ready to kick off the reimage, the command will be: sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
[17:24:09] <stashbot>	 T278423: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423
[17:24:22] <razzi>	 this is the dangerous one, so I'll wait for elukey and joal to +1 before I go for it
[17:24:34] <elukey>	 elukey@an-master1001:~$ sudo find /srv/hadoop/name ! -user 903
[17:24:34] <elukey>	 elukey@an-master1001:~$ sudo find /srv/hadoop/name ! -group 903
[17:24:34] <elukey>	 elukey@an-master1001:~$ sudo find /srv/hadoop/name -user 903 | wc -l
[17:24:35] <joal>	 ack :)
[17:24:37] <elukey>	 364
[17:24:39] <elukey>	 elukey@an-master1001:~$ sudo find /srv/hadoop/name -group 903 | wc -l
[17:24:42] <elukey>	 364
[17:24:45] <elukey>	 looks good to me (uid/gid I mean)
[17:25:06] <joal>	 awesome elukey - +1 for reimage?
[17:25:14] <elukey>	 checking one thing
[17:25:17] <joal>	 sure
[17:25:43] <elukey>	 +1 should be ok
[17:25:59] <razzi>	 proceeding with reimage on cumin1001
[17:27:03] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-razzi: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` an-master1001.eqiad.wmnet ` The log...
[17:27:07] <razzi>	 !log razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
[17:27:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:37:39] <elukey>	 razzi: all good??
[17:38:05] <razzi>	 elukey: I'm in the console for the reimage now, and it's asking me which disk to partition
[17:38:16] <razzi>	 elukey: I don't remember this step from last time, thought there were fewer prompts
[17:38:24] <razzi>	 wanna take a look? I'm still in the batcave
[17:39:50] <razzi>	 Partitioning method:
[17:39:50] <razzi>	 Guided - use entire disk
[17:39:50] <razzi>	 or
[17:39:50] <razzi>	 Guided - use entire disk and set up LVM
[17:42:17] <elukey>	 razzi: so it should be already pre-compiled, with partitions marked as "K" or "F"
[17:42:32] <razzi>	 hmm I'm not seeing that screen
[17:42:46] <razzi>	 https://usercontent.irccloud-cdn.com/file/1crxegLa/image.png
[17:43:00] <razzi>	 maybe I have to proceed through this menu first?
[17:43:11] <elukey>	 joining bc
[17:43:15] <razzi>	 thx!
[17:53:14] <elukey>	 (going afk for a bit but I'll check later!
[17:53:31] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm)
[17:56:15] <razzi>	 Reimage is proceeding smoothly :)
[17:56:26] <razzi>	 Debian GNU/Linux 10 an-master1001 ttyS1
[18:05:27] <razzi>	 an-master1001 is a Hadoop Master (NameNode & ResourceManager) (analytics_cluster::hadoop::master)
[18:05:27] <razzi>	 The last Puppet run was at Tue Jul 20 17:53:30 UTC 2021 (11 minutes ago).
[18:05:27] <razzi>	 Last puppet commit: (44723368f1) RLazarus - scap: Drop never-used 'sqldump' tool
[18:05:27] <razzi>	 Debian GNU/Linux 10 auto-installed on Tue Jul 20 17:51:03 UTC 2021.
[18:05:27] <razzi>	 razzi@an-master1001:~$
[18:05:39] <razzi>	 \\\ \o/ ////
[18:05:50] <razzi>	 ok cool
[18:11:43] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-razzi: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-master1001.eqiad.wmnet'] `  and were **ALL** successful.
[18:13:55] <razzi>	 Patch to re-enable yarn queues once we're ready: https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732
[18:21:46] <razzi>	 !log re-enable yarn queues by merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732
[18:21:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:22:25] <razzi>	 !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
[18:22:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:22:59] <razzi>	 Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful
[18:31:55] <razzi>	 !log razzi@an-master1002:~$ sudo systemctl stop hadoop-yarn-resourcemanager.service
[18:31:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:32:02] <razzi>	 to manually transition 1001 to active
[18:32:11] <elukey>	 back!
[18:32:23] <razzi>	 !log razzi@an-master1002:~$ sudo systemctl start hadoop-yarn-resourcemanager.service
[18:32:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:32:58] <razzi>	 razzi@an-master1002:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
[18:32:58] <razzi>	 active
[18:32:58] <razzi>	 razzi@an-master1002:~$ sudo -u yarn yarn rmadmin -getServiceState an-master1001-eqiad-wmnet
[18:32:58] <razzi>	 active
[18:33:02] <razzi>	 yayyy
[18:33:26] <razzi>	 going to enable and run puppet on an-launcher, and that's a wrap!
[18:33:29] <elukey>	 razzi: little nit - for yarn it is sufficient to restart the active one to trigger a failover
[18:34:24] <razzi>	 !log razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
[18:34:24] <razzi>	 Oh ok, as opposed to stop / start huh
[18:34:25] <joal>	 elukey: he's learnt it the hard wa ;)
[18:34:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:35:18] <elukey>	 I am checking metrics and an-master1001, so far all good
[18:35:52] <elukey>	 gid/uid all good, partitions good
[18:37:37] <elukey>	 razzi: spark-shell returns me error, the root queue is still stopped
[18:37:42] <razzi>	 !log razzi@an-master1002:~$ sudo -i puppet agent --enable
[18:37:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:38:16] <razzi>	 Running puppet on an-master1001 to apply the change to the xml
[18:38:37] <elukey>	 ah ok perfect
[18:39:03] <elukey>	 the refreshQueues should be sufficient for the active one
[18:39:13] <elukey>	 I just noticed that you ran it on 1002, perfect
[18:39:47] <razzi>	 !log razzi@an-master1001:/var/log/hadoop-hdfs$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
[18:39:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:40:11] <razzi>	 ok I think we're ready to enable puppet on an-launcher
[18:40:21] <razzi>	 good work team!!!!!!
[18:40:37] <razzi>	 !log razzi@an-launcher1002:~$ sudo puppet agent --enable
[18:40:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:40:54] <elukey>	 spark shell works for me, +1
[18:41:05] <elukey>	 good job! Finall all on buster :)
[18:41:10] <elukey>	 just in time for bullseye :D
[18:41:19] <elukey>	 going to dinner, ttl!
[18:42:15] <joal>	 bye elukey - thanks a lot for the help :)
[18:42:21] <joal>	 Going to diner soon as well :)
[18:42:23] <razzi>	 bye elukey, great success! enjoy dinner
[18:42:35] <razzi>	 you too joal, thanks for all the support!
[18:43:03] <joal>	 Happy to help razzi - great work :)
[18:47:17] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-razzi: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) Follow up: change partman to remove -test config, no need to manually confirm the partitions every time since there was no complication
[18:49:21] <razzi>	 Taking a short computer break, but I'll have my phone so ping me if you need me
[18:50:08] <razzi>	 Metrics look healthy and no alarms, but I'll keep checking in periodically for the next hour. Let me know if anything looks amiss!
[18:51:12] <wikibugs>	 10Analytics, 10Analytics-Data-Quality, 10Datasets-Webstatscollector: Add alarms for high volume of views to pages with replacement characters - https://phabricator.wikimedia.org/T117945 (10Milimetric) p:05Low→03Medium
[18:53:06] <wikibugs>	 10Analytics, 10Pageviews-API: Pageviews API should allow specifying a country - https://phabricator.wikimedia.org/T245968 (10Milimetric) Please see my question from T245968#5912524, without a good reason we default to not providing data.
[18:58:14] <mforns>	 !log starting refinery deployment
[18:58:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:58:20] <wikibugs>	 10Analytics, 10Analytics-Kanban: Crunch and delete many old dumps logs - https://phabricator.wikimedia.org/T280678 (10Milimetric) Moving back to incoming to triage with Olja
[19:00:39] <mforns>	 joal: I don't see any further refinery changes merged or unmerged since last deploy on the 15th
[19:00:49] <joal>	 Ah?
[19:00:53] <joal>	 checking
[19:01:28] <joal>	 there have been a deploy on the 15th?
[19:01:48] <mforns>	 yes, Andrew did I think
[19:02:09] <joal>	 mforns: checking
[19:03:22] <joal>	 mforns: I confirm the code in on an-launcher1002 - All good :)
[19:03:48] <mforns>	 so no deploy needed right?
[19:04:09] <joal>	 if the only change was sqoop, no need :)
[19:04:16] <mforns>	 ok, thaaanks joal :]
[19:04:22] <joal>	 thank you mforns :)
[19:19:18] <joal>	 ok stopping for diner - first gobblin-webrequest job still running, so jobs not yet unlocked - should happend soon
[19:30:16] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:41:58] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:46:30] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:02:27] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:07:07] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:15:05] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:23:27] <joal>	 systems have recovered after data ingestion caught up - almost done
[20:30:12] <joal>	 !log rerun webrequest timed-out instances
[20:30:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log