[10:05:43] !log reboot an-worker1127 - hdfs datanode caused CPU stalls [10:05:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:06:17] rebooted an-worker1127 folks, I think that the hdfs datanode was down since a couple of days, the blocks were replicated elsewhere [10:15:41] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:02] all up and running, just forced puppet [10:17:33] RECOVERY - puppet last run on an-worker1127 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:23:50] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [10:24:53] there are around ~200 blocks --^ [10:25:10] it is probably the fact that an-worker1127 was in an unclean state, in the past the blocks fixed themselves [10:25:13] I'll check later :) [12:08:51] elukey@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks [12:08:54] Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F [12:08:57] The filesystem under path '/' has 0 CORRUPT files [12:09:12] so the metric is a jmx stale value probably, hopefully it will clear soon :) [12:13:50] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks