[07:01:49] Good morning [07:56:32] helloooo [09:54:58] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/699757 (https://phabricator.wikimedia.org/T284885) (owner: 10Gerrit maintenance bot) [09:55:55] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/698524 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot) [13:03:24] hewooo [13:04:49] Hi ottomata [13:43:44] 10Analytics, 10Event-Platform: jsonschema-tools should fail if new required field is added - https://phabricator.wikimedia.org/T263457 (10Ottomata) https://github.com/wikimedia/jsonschema-tools/pull/30 [13:43:47] 10Analytics, 10Event-Platform: Schema compatibility check for changing event schemas fails when adding to the middle of an array - https://phabricator.wikimedia.org/T270470 (10Ottomata) https://github.com/wikimedia/jsonschema-tools/pull/30 [13:43:59] 10Analytics, 10Analytics-Kanban, 10Event-Platform: jsonschema-tools should fail if new required field is added - https://phabricator.wikimedia.org/T263457 (10Ottomata) a:03Ottomata [13:44:15] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Schema compatibility check for changing event schemas fails when adding to the middle of an array - https://phabricator.wikimedia.org/T270470 (10Ottomata) a:03Ottomata [14:23:58] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson) @nshahquinn-wmf yes, I think we can close this. [14:28:17] hio razzi elukey am here! :) [14:28:47] Hi ottomata! Here as well [14:29:56] o/ [14:33:39] Alright, starting to disable jobs on an-launcher1002 [14:34:55] There are many, not going to log all of them [14:35:32] !log disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641 [14:35:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:35:52] check at the end that systemctl list-timers show the expected ones (so no extra that is still running etc..) [14:35:58] *shows [14:40:06] Looks good to me! [14:43:56] I also stopped analytics-reportupdater-logs-rsync.timer just to be sure, the rest is good! [14:57:36] Alright, I'm about to set the queue state to stopped via https://gerrit.wikimedia.org/r/c/operations/puppet/+/699943 [14:58:56] k! [15:01:23] !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to stop queues [15:01:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:02:04] !log disable puppet on an-masters [15:02:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:02:38] razzi: did you run puppet to update yarn-site.xml before refreshQueues? [15:02:43] yep! [15:03:13] because from yarn.w.o's scheduler the queue state is still RUNNING [15:03:37] Hmmm [15:04:15] trying a spark2-shell, it should fail in theory [15:04:21] nope [15:04:27] so the queues are not stopped :( [15:04:42] Good catch, let me see why that might be [15:04:56] oh [15:05:06] I did puppet merge but forgot to run puppet agent [15:05:11] :) [15:05:16] that would explain it [15:06:14] Where do you see scheduler state on yarn.w.o ? [15:07:05] in the left menu' -> Scheduler -> queue-name -> [15:07:34] "Application Queues" -> open dropdown -> etc.. [15:08:58] !log run puppet on an-masters to update capacity-scheduler.xml [15:09:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:09:08] !log disable puppet on an-mastesr [15:09:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:11:09] !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [15:11:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:11:54] Ok this is good, I see Queue State: DRAINING [15:14:03] good :) [15:14:24] the final step is to run a spark2-shell --master yarn, it should fail (just as confirmation) [15:17:47] Yep, it failed (and not just due to missing kinit :) [15:18:18] gooood [15:18:32] we can proceed [15:18:41] Ok, now we have to kill remaining yarn applications [15:19:36] we can even think to keep them alive [15:19:45] we go in safe mode, and then we proceed from there [15:20:23] if you go in safe mode while they are alive [15:20:27] won't some of them fail? [15:20:32] mmm there is one notebook that it is taking a lot of the cluster [15:21:20] ottomata: the queues are already stopped, in theory they may not be able to create containers now (but I am not super sure about it). With safe mode they'll get an error if trying to write anything to hdfs yes [15:21:40] anyway, we can kill all, seems safer [15:21:41] :) [15:23:09] razzi: are you going to kill the running apps? [15:23:19] Yep [15:23:21] super [15:23:23] let's do it [15:25:48] !log kill running yarn applications via for loop [15:25:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:26:20] Need a sudo [15:27:07] sudo kerberos-run-command yarn yarn application -kill application_1620304990193_91010 [15:27:07] worked [15:27:22] Cluster is empty, enabling safe mode [15:27:47] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter [15:27:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:28:09] Ok, the next step is the one that failed before: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [15:28:33] I'm going to take a quick snack break then will run that on an-master1001 [15:29:20] ok [15:32:30] Alright, I'm back, here goes [15:33:02] Metrics at https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1 [15:33:16] command launched? [15:33:36] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [15:33:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:35:22] Save namespace successful for an-master1001.eqiad.wmnet/10.64.5.26:8020 [15:35:23] razzi: I don't see horrors in the logs, did it work? [15:35:25] \o/ [15:35:26] woooooooowwwwwww [15:35:28] * elukey dances [15:35:33] \o/ \o/ [15:36:37] Save namespace successful for an-master1002.eqiad.wmnet/10.64.21.110:8020 [15:36:37] as well [15:36:40] ok cool! [15:37:39] we can proceed with copying the fsimage over to a backup node [15:37:50] and at this point we are free to reimage 1002 [15:38:11] cool [15:38:36] !log backup /srv/hadoop/name/current to /home/razzi/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz on an-master1001 [15:38:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:39:01] NICE! [15:40:02] storage on `/` on an-master1001 is going to get pretty high while creating this backup [15:40:22] razzi: yes let's create the backup under /srv [15:40:26] there is plenty of space [15:40:36] Ohhh I see now [15:40:40] Should I cancel it? [15:40:42] yep [15:40:56] ok, removing the incomplete .tar.gz [15:41:01] and clean up, so we don't risk any weird trouble in / [15:41:10] yep yep [15:42:32] 10Analytics, 10User-razzi: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) 05Open→03Resolved a:03elukey After the changes to heap size and service handler workers we were able to run saveNamespace successfully! [15:42:54] !log tar -czf /srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current on an-master1001 [15:42:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:45:25] Ok, this step is going to take a little while, tar is writing writing ~400M / minute and /srv/hadoop/name/current is 9G [15:45:53] So hang tight for about 15 more minutes [15:45:53] Then I'll start the transfer, also slow [15:46:09] In the meantime, as elukey said, we can reimage 1002 [15:46:49] razzi: nope :) [15:47:03] we can start the reimage only when we have a safe backup [15:47:10] not while we are taking a backup :D [15:47:16] ok, I was thinking about that too :) [15:47:57] 15 mins of waiting time is fine! [15:48:08] so in the meantime, what are the next steps? [15:48:16] (so we can prep) [15:48:45] going to run uid script, and need to make sure no services are running that could cause it to fail [15:49:00] on 1002 [15:49:43] Right. Looks like I left out steps to actually stop these services: hadoop, yarn, etc [15:49:49] yep perfect [15:51:53] Ok backup is done (compression is quite good, 9G turned into 2.6G) [15:53:50] let's move them to a backup host [15:55:17] * elukey bbiab [15:55:19] !log sudo transfer.py an-master1001.eqiad.wmnet:/srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage [15:55:21] \o/ ! successfull saveNamespace - awesome work elukey and razzi :) [15:55:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:55:30] * joal dances with elukey :) [15:55:33] thanks joal ! Time to celebrate [16:01:21] Wow, transfer.py is already done! So fast!!! [16:01:58] razzi: just to be sure, do something like tar --list (IIRC it works) on the backup node to inspect the files [16:02:21] you can also unpack and check size etc.., just so we are 100% sure [16:02:24] then we can proceed [16:03:45] script + wmf-reimage for 1002 (and then we have to join the mgmt console to verify and acknowledge the partition scheme [16:04:49] untarring, will confirm it looks right shortly [16:05:03] yep, the partman stuff I'm not excited for [16:09:07] everything looks right on backup on stat1004 [16:09:15] ack [16:09:20] Time stop stop hadoop services on an-master1002 [16:09:52] I'm going to downtime host on icinga [16:10:48] +1 [16:11:07] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10ops-monitoring-bot) Icinga downtime set by razzi@cumin1001 for 60 days, 0:00:00 1 host(s) and their services with reason: Update operating system to bullsey... [16:11:30] !log downtime an-master1002 [16:11:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:12:08] Ok, I thought the -D was minutes, not days... but I'll be sure to end the downtime when maintenance is over [16:14:43] !log sudo systemctl stop hadoop-* on an-master1001, then realize I meant to do this on an-master1002, so start hadoop-* [16:14:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:15:26] I'm glad we're in maintenance mode, that would have been messy if there were running applications [16:16:20] !log sudo systemctl stop 'hadoop-*' on an-master1002 [16:16:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:16:26] razzi: wait wait [16:16:33] ? [16:16:52] did you just stop hadoop* on 1001 no? we need to wait some mins for hdfs to recover [16:16:56] please don't proceed with 1002 [16:17:14] did you check https://grafana-rw.wikimedia.org/d/000000585/hadoop?orgId=1 ? [16:17:59] No, I didn't check, my bad [16:18:38] ok so it seems that 1001 is active according to getServiceState, and 1002 is already down.. so let's wait a bit before proceeding [16:18:45] for metrics to recover [16:18:46] yep, just checked that, will wait [16:19:04] I think that we may be out of safemode now, but it is not a problem [16:20:42] razzi: mistakes can happen, don't worry, we are here to learn :) [16:20:48] :) [16:21:16] Thanks for making room for mistakes elukey, I'm definitely learning!! [16:23:04] razzi: everybody makes mistakes! One thing that I learned over time is to double check all commands before executing every time, to avoid that feeling of trying to complete the work asap [16:23:58] all right so metrics are basically recovered, and logs are good on 1001 [16:24:19] it is active as expected, etc.. [16:25:44] razzi: if you think it is ok to proceed (check logs metrics etc..) then +1 from my side [16:25:53] remember to run the script on 1002 with tmux [16:26:03] or screen, as you prefer [16:27:10] team tmux! [16:28:44] razzi: if you want I can check with you the mgmt console for partman later on (you drive I answer questions if you need anything :) [16:28:44] Metrics look stable, will run the uid script [16:29:02] yeah that'd be helpful [16:30:10] PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:12] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:16] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:19] Hmm [16:30:24] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:24] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:26] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:27] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:27] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:28] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:31] is yarn up? [16:30:32] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:34] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:36] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:37] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:40] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:40] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:41] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:42] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:44] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:44] PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:50] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:50] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:54] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:58] PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:04] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:06] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:07] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:08] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:08] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:13] PROBLEM - Hadoop NodeManager on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:14] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:16] PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:17] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:17] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:18] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:19] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:20] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:21] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:22] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:25] elukey: if you know what to do, I'm a bit lost at the moment. Tried to run uid script, didn't have sudo so it failed, might have caused some grave issue [16:31:26] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:27] PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:28] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:28] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:34] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:37] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:38] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:38] PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:45] razzi: where did you run the script, on 1002 ? [16:31:46] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:48] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:48] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:48] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:50] (to confirm) [16:31:50] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:52] elukey: yes [16:31:54] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:56] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:04] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:04] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:07] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:10] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:17] cccccclktlenejdltnteufuguurvikfvtnfegtiefkuv [16:32:18] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:18] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:20] razzi: so first thing first, alert in #operations that we are on it [16:32:25] and in #sre [16:32:27] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:32] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:48] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:54] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:33:02] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:33:06] !log restart hadoop-yarn-resourcemanager on an-master1001 [16:33:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:33:52] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:07] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:14] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:41] was rm stopped on 1001? [16:34:43] razzi: we should be ok, for some reason the Yarn RM on 1001 was not feeling good (yarn.w.o was also not responding) and all the node managers didn't like it [16:34:43] it looked like no? [16:34:49] yeah very strange... [16:35:14] I restarted it, I think that the stop 1001, start, stop 1002 may have get it into confusion state [16:35:24] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:48] hm [16:36:22] RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:32] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:37] RECOVERY - Hadoop NodeManager on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:38] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:40] RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:44] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:54] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:26] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:30] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:57] razzi: if you can also make a summary in #sre it would be good [16:38:02] RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:17] RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:27] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:31] oh i see accidental stop of 1001 [16:38:36] yeah [16:38:36] ok got it [16:38:52] elukey: so there was a delay between the accidental stop and the nodes alerting? [16:38:53] ottomata: we need to add some alarm aggregation for sure as follow up [16:38:57] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:17] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:35] razzi: the node managers do some retries to contact the resource manager (any active one) and then they give up [16:40:09] razzi: the recoveries are probably puppet bring up the nodemanagers again (they basically stopped) [16:40:14] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:15] ("oh noes not RM, stopping) [16:40:43] razzi: the good thing to do now is to run via cumin (in batches, say 5/10 at the time) run-puppet-agent to A:hadoop-worker [16:40:54] so we'll bring the rest of the Node managers up [16:41:27] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:40] razzi: then let's sync about what is the status for 1002 [16:41:57] RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:08] we should probably retry the script, if it doesn't work it is not a big deal, we can fix things manually after the reimage [16:42:29] (we can jump on batcave if you want to brainbounce) [16:42:44] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:54] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:54] RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:02] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:10] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:50] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:54] RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:04] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:44] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:28] razzi: ping :) [16:45:29] elukey: let's brain bounce [16:45:34] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:42] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:46:07] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:46:48] RECOVERY - Hadoop NodeManager on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:47:37] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:00] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:14] RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:32] RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:34] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:48] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:02] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:18] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:28] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:47] RECOVERY - Hadoop NodeManager on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:27] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:34] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:50] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:54] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:06] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:20] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:27] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:40] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:21] !log run uid script on an-master1002 [16:53:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:53:44] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:57] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:54:14] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:54:42] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:54:54] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:12] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` an-master1002.eqiad.wmnet ` The log can be found in... [16:55:24] !log sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet [16:55:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:55:27] T278423: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 [16:55:34] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:36] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:56:20] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:57:18] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:26:42] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-master1002.eqiad.wmnet'] ` and were **ALL** successful. [17:29:25] razzi: elukey@an-master1001:~$ sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet [17:29:28] standby [17:29:31] Ok great, reimaging is good [17:29:55] Time to end maintenance? [17:30:48] yes I think we are good, just remember to set the yarn queues to running [17:30:56] before re-enabling all [17:31:57] also let's clean up the backup on stat1004, etc.. [17:38:19] Patch to re-enable queues: https://gerrit.wikimedia.org/r/c/operations/puppet/+/699955 [17:41:40] Re-enabling yarn queues [17:45:01] !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [17:45:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:45:34] Ok, queues are running, safe mode is off, going to enable timers on an-launcher [17:45:51] perfect [17:45:55] !log enable puppet on an-launcher [17:45:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:46:26] !log remove hdfs namenode backup on stat1004 [17:46:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:23] going afk, o/ [17:49:18] elukey: if you're still around, noticed some alerts on icinga for an-master1002 [17:49:36] the host is still downtimed and standby, so no alert, but I'd like to figure out why they are there [17:55:58] razzi: i'm here [17:56:09] looking [17:56:42] huh weird, razzi i just see the Check whether microcode mitigations for CPU vulnerabilities are applied [17:56:43] one [17:56:43] right? [17:57:11] Ok, maybe the others fixed themselves, there was one about the clock [17:58:57] dunno what is up with that microcode one [17:59:00] would have to ask moritzm i think [17:59:09] i think it can be ignored for now [17:59:17] not a blocker to ending downtime [17:59:30] ok, it's not going to alert I guess? [17:59:37] I've seen a microcode one before [17:59:58] I guess the fix is to reboot [18:00:11] but I don't want to take out the standby while the cluster is not in safe mode [18:00:14] so I'll ignore for now [18:02:03] which host is that? [18:03:07] moritzm: an-master1002 [18:03:09] just got reimaged [18:04:26] so the first puppet run installs the microcode package, but it only gets loaded with the next boot [18:04:42] that's why the reimage script (among other reasons) does a reboot in the end [18:04:50] maybe it got interrupted and this didn't happen here [18:05:23] razzi: i think its ok to reboot the standby real quick while not in safe mode [18:05:34] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) savenamespace worked :tada: and an-master1002 is running Buster! Next steps: - later this week, failover to 1002 and make sure it can operate as ac... [18:05:34] i don't think we usually take down the wholle cluster to e.g. to JVM updates [18:05:45] but no hurry on it i think [18:06:21] yeah I'll hold off for now [18:13:07] Ok maintenance is officially over, a success! I'm going to take a computer break but I'll be around if anything is going wrong [18:21:14] razzi: NICE JOB! [18:23:48] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:25:26] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:30:04] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:30:30] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:35:52] 10Analytics, 10Event-Platform: jsonschema-tools should allow skipping of repository tests for certain schemas. - https://phabricator.wikimedia.org/T285006 (10Ottomata) [18:35:59] 10Analytics, 10Event-Platform: jsonschema-tools should allow skipping of repository tests for certain schemas. - https://phabricator.wikimedia.org/T285006 (10Ottomata) p:05Triage→03Medium [18:38:39] Thanks razzi :) [18:38:58] uop, looking into druid loading alerts... [18:42:08] mforns: I think there errors come from the fact that the timers have restarted before the cluster was ready [18:42:17] mforns: thanks for looking into it [18:42:26] looking at logs [18:43:34] journal logs don't give any info, the error was in Druid ingestion [18:45:56] druid logs: "errorMsg": "java.io.FileNotFoundException: File /tmp/druid-indexing/wmf_netflow/2021-06-15T180036.552Z_5366f07bcfcd4e05908700b34404f2c2/segmentDescriptorInfo does not exist." [18:48:34] restarting eventlogging_to_druid_editattemptstep_hourly, to see if it succeeds [18:51:42] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:52:06] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:54:31] mforns: do you get github review requests? [18:54:38] i have a couple of jsonschema-tools PRs i'm looking for review [18:54:54] including one that will remove that extra step to comment out that test ignore rule for analytics legacy schemas when doing the migration [18:56:12] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:57:50] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:58:34] ottomata: I think so, let me check [18:59:12] ottomata: ah! github ones? [18:59:27] yup [18:59:34] ottomata: yes, I get them on my personal email [18:59:39] just received one [18:59:43] https://github.com/wikimedia/jsonschema-tools/pull/30 [18:59:46] https://github.com/wikimedia/jsonschema-tools/pull/31 [19:00:06] yea I see them [19:00:27] will look [19:01:38] ty! [19:01:41] BTW, ottomata qq: I'm trying to access the Airflow API from within an Airflow DAG, but run into auth issues, not sure what to do there, is the API set up to be queried? [19:02:16] oh! i think we did not enable it [19:02:19] it could be enabled [19:02:22] you trying to get fannncycyy [19:02:23] hehe [19:02:29] will look into that [19:02:34] xD, no just trying options for Refine DAG [19:02:53] for delayed data ingestion [19:03:27] see if a task can be cleared from the API, so it is re-run [19:06:26] mforns: are you trying on an-test-coord? [19:06:34] ottomata: yes [19:07:41] mforns: how can I test? [19:12:14] nm got one [19:12:14] curl -H 'Content-Type: application/json' -H 'Accept: application/json' http://localhost:8600/api/v1/config [19:15:14] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:19:46] ottomata: do the api gateway and restbase stuff end up in webrequest? [19:20:02] addshore: everything is in webrequest, just likely with less info [19:20:25] ack [19:21:02] this simple count query ends up being pretty fast in hadoop which is nice :) [19:22:13] including things like wdqs, beacons / events, rest.php and api.php (so missing restbase and api gateway) it is 17.4k rps for some arbitrary hour. dding restbase thats ~20k rps [19:22:22] *30k rps [19:22:37] ok mforns you should be able to access api now [19:32:21] ottomata: thanks a lot!! [20:09:26] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:32:40] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Cmjohnson) [20:33:16] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Cmjohnson) a:05Cmjohnson→03RobH @robh the onsite work for this server is completed [22:22:08] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10nshahquinn-wmf) 05Open→03Resolved [22:22:11] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10nshahquinn-wmf) [22:22:13] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10nshahquinn-wmf)