[06:59:27] good morning, I'm about to reboot krb1001 in ~ 5 minutes, there should be no impact on kerberized services, the only thing that will briefly be unavailable is the kadmin service which is used to created Kerberos principals or change passwords [07:07:42] (SystemdUnitFailed) firing: hadoop-hdfs-namenode.service Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:30] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:57] krb1001 is back up [07:09:16] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [07:12:51] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [07:14:12] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:47] btullis, stevemunene - was there any ops from you on an-master1001 this morning? [07:14:51] about this --^ [07:14:58] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [07:15:11] hadoop-hdfs-namenode-an-master1001.log shows a "Timed out waiting 20000s for q quorum of nodes to respond" [07:16:13] the timing correlates with the krb1001 reboot, but haven't seen that error in any previous KDC reboots and it would also be surprising, given that the service can fallback to krb2002 [07:16:30] weird indeed moritzm [07:16:31] joal: checking [07:16:38] unless there is some bug which makes Java or the Namenode code not query the secondary KDC [07:17:42] (SystemdUnitFailed) resolved: hadoop-hdfs-namenode.service Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:49] although actually that seems to be unrelated to the KDC reboot [07:18:48] there are not a lot of info in the logs, the 20s timeout to the quorum nodes is weird, 20 seconds of delay is a lot [07:19:22] elukey: this quorum, is it for kerb or journalnodeS? [07:19:52] hadoop.hdfs.qjournal.client.QuorumJournalManager [07:19:57] yeah journals [07:20:27] ack not related to kerberos then [07:20:27] at 07:05:15 it shows that it waited 6000ms [07:20:41] hm - journalnodes issues I dislike :( [07:20:52] but at the time krb1001 was already back up for approx 30s [07:21:05] just not fully recovered in Icinga [07:21:15] I think it may be related to kerberos, it is not explicilty stated but maybe it failed to authenticate the journal nodes for some reason [07:22:01] what about "(HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. ", is that auto-recovering along with the other services or needs some followup? [07:22:06] my impression from recent issues is that the namenodes are really big now and they tend to have troubles when something changes [07:22:58] moritzm: "The filesystem under path '/' has 0 CORRUPT files", just checked, it is the (sadly) usual jmx staleness issue [07:23:19] joal: an-master1002 is the leader atm, I'd leave it like that until 1001 fully bootstraps and recover [07:23:35] ack elukey - thanks for checking [07:24:59] (stepping afk, will check later if needed!) [07:24:59] ack, thanks. let's just keep an eye on it the next time we need to reboot the KDCs [07:47:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [07:52:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:00:02] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) [08:07:33] qq: how can I know in which rack a host I'm ssh-ing on is located? (ex: A/2) [08:08:42] brouberol: various ways, netbox is the authoritative source, but that information is also exported to puppet's hiera, that goes into puppetdb and is also exposed in the MOTD when you ssh [08:08:50] Bare Metal host on site eqiad and rack B6 [08:08:52] for example [08:09:19] gotcha, thanks [08:09:27] so we can also query via cumin hosts in a given rack or row [08:10:34] for reference, the rack ID is rendered into /etc/update-motd.d/50-netbox-location [08:11:17] for VMs it reports the cluster instead [08:11:31] Virtual Machine on Ganeti cluster eqiad and group A [08:16:13] 10Data-Platform-SRE, 10Patch-For-Review: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) [08:17:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:22:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:22:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:25:29] brouberol: Great work on https://gerrit.wikimedia.org/r/c/operations/puppet/+/956785 - A useful technique you can apply now is how to run the puppet compiler against this patch, simulating a puppet run. [08:25:53] Docs here: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler [08:27:25] Good to know! Let me have a quick look as soon as I'm out of the Concur Personal Settings maze [08:27:30] Where you added your *real* kerberos principals to the private repo, you'll want to add dummy secrets to the labs/private repo here: https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/modules/secret/secrets/kerberos/keytabs/ [08:27:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [08:30:00] joal: Should we go for the namenode fail-back at about 50 minutes past the hour? That's usually a quiet time, isn't it? [08:30:14] Works for me btullis [08:30:50] And we should raise a ticket to investigate what the kerberos issue was. [08:30:56] yup [08:37:51] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [08:40:15] btullis: done [08:41:10] 10Data-Platform-SRE, 10Patch-For-Review: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) ` ~/wmf/private master ❯ mkdir -p modules/secret/secrets/kerberos/keytabs/an-worker11{49,50,51,52,53,54,55,56}.eqiad.wmnet/hadoop ~/wmf/private master ❯ touch... [08:41:28] looking at the puppet compiler step now [08:44:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) a:03achou [09:02:08] I usually tend to run something like `./utils/pcc 956785 an-master1001.eqiad.wmnet,an-worker1149.eqiad.wmnet` from my workstation. [09:08:49] or if you use the Hosts header in the commit message can also trigger it with a "check experimental" comment in Gerrit [09:09:17] the Hosts header accepts a cumin-like syntax, but usually "Hosts: auto" does the right thing [09:25:06] +1 on what volans said too :-) [09:28:33] semi-related the puppet repo has a .gitmessage template, if you want to use it: git config commit.template=.gitmessage [09:44:23] could I have a 2nd approval on https://gerrit.wikimedia.org/r/c/labs/private/+/956789/? I'm just adding dummy/empty secrets [09:45:11] volans: oh, that;s nice to know, thanks [09:47:25] brouberol: You can feel free to +2 your own changes to labs/private without a review, unless you really want one. [09:48:59] 👍 done [09:50:51] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10gmodena) Hey @Antoine_Quhen, Thanks for the pointers. > For Airflow dags, we are using trusted-runn... [09:53:44] joal: I'm looking to run that failback now. Bit later than planned, but should still be OK. [09:53:46] ? [09:54:49] !log btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [09:54:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:55:11] https://www.irccloud.com/pastebin/rFKKzA1T/ [09:55:22] ack btullis [09:55:29] Failback looks successful. [10:00:52] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) [10:43:35] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10MoritzMuehlenhoff) [10:43:52] JFTR, made an edit to the task abive, the KDC standby is in codfw (krb2002) [10:44:14] ack, thanks moritzm [10:58:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [11:03:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [11:16:14] 10Data-Platform-SRE, 10Patch-For-Review: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) ` brouberol@puppetmaster1001:~$ sudo puppet-merge ... Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet (1d71a9e090) Merge these changes? (ye... [11:18:12] brouberol: Another useful thing to know about is the Server Admin Log (SAL) https://wikitech.wikimedia.org/wiki/SAL [11:19:40] Most cookbooks automatically log to it, but some IRC channels are also hooked up, so if you start a message with !log it will end up in https://sal.toolforge.org/ [11:20:31] This channel logs to https://sal.toolforge.org/analytics and if you include a Phab ticket slug it will also cross-reference it in phabricator. [11:21:25] !log demonstrated the use of SAL for T343762 [11:21:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:21:27] T343762: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 [11:21:28] oh, this is handy! [11:22:42] #wikimedia-operations is a general purpose place to which cookbooks usually log, but #wikimedia-analytics is the place where we tend to log more about our stuff manually. [11:23:07] ack [11:23:46] speaking of, I think https://phabricator.wikimedia.org/T343762 is done. I'll wait for puppet to run on the hosts until I close it [11:24:49] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) @AikoChou - we hav... [11:25:29] There's a comment here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/956785/comments/6e9c4652_d786fc21 I think we will need to make sure that the hadoop masters are updated and roll-restarted to see the new datanodes. [11:26:38] Then you can check on the balanced-ness to make sure that the datanodes get used: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Checking_balanced-ness_on_HDFS [11:26:42] (SystemdUnitFailed) firing: hadoop-hdfs-datanode.service Failed on an-worker1153:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:52] I could put these notes on the ticket, I suppose. [11:30:05] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:42] (SystemdUnitFailed) firing: (2) hadoop-hdfs-datanode.service Failed on an-worker1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:37] PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:32] FYI, `hadoop-hdfs-datanode.service` wasn't running after a first manual puppet run, but was after the 2nd [11:36:42] (SystemdUnitFailed) firing: (3) hadoop-hdfs-datanode.service Failed on an-worker1149:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:43] RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:55] ^ my point exactly [11:38:47] Yeah, we could have put these hosts in downtime while we deployed them, I suppose. That would have prevented these alerts. Maybe there is still value in doing so, otherwise they will continue to arrive for the next 20 minutes or so. [11:39:22] You could use the `sre.hosts.downtime` cookbook. [11:41:03] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:42] (SystemdUnitFailed) resolved: hadoop-hdfs-datanode.service Failed on an-worker1149:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:10] I set a 4h downtime on an-worker115[0-6].eqiad.wmnet [11:42:43] 👍 [11:45:48] In the meantime, I can see an-worker1149 in the hadoop UI, so that's good. I suppose we'll have to trigger an hdfs rebalance once all workers have joined [11:46:39] Great! There is actually a scheduled rebalance on a systemd timer, I think. [11:46:58] perfect [11:47:06] So we don't need to do that rolling-restart of the namenode processes after all. [11:47:15] and I'm seeing a volume failure for an-worker1086.eqiad.wmnet. Is that something we're tracking? [11:48:15] Ah, it's a RAID controller battery I think. I'm due to make a task for it. [11:48:36] ref: https://wikimedia.slack.com/archives/C02291Z9YQY/p1694025260349459 [11:49:15] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:16] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) It took 2 puppet runs to get `hadoop-hdfs-datanode.service` running on `an-worker1149`. I've set an icinga downtime on the remaining 7 hosts to avoid getting systemd service failure... [11:50:53] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:53] brouberol: I have added you as an owner to this list: https://lists.wikimedia.org/postorius/lists/data-engineering-alerts.lists.wikimedia.org You may see a subscription request or something. [11:56:29] PROBLEM - HDFS topology check on an-master1001 is CRITICAL: CRITICAL: There is at least one node in the default rack. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [11:57:15] brouberol ^ ah. It looks like we might have to do a namenode rolling restart after all. [12:05:56] or maybe I did forget one node in the hieradata? [12:07:14] https://www.irccloud.com/pastebin/7rOlnUka/ [12:07:49] I don't think you forgot it. Just checking now. [12:07:50] I was going to paste the same thing [12:07:56] let's RR after all nodes have joined? [12:08:29] Yep, I think that's right. [12:09:04] The net-topology script does the right thing: [12:09:09] https://www.irccloud.com/pastebin/SKuDDr9j/ [12:09:38] SO I think it must just run it once on startup, or something. [12:10:23] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) {F37710141} All 8 workers have registered into the cluster. We now need to rolling-restart the master nodes to assign them to a non-`default` rack. [12:10:49] it's funny that only 5/8 of the new workers were assigned to the default rack [12:11:06] for ex, an-worker1156.eqiad.wmnet was assigned to /eqiad/F/3 [12:12:40] er, yes that is odd. I was just wondering whether a `dfsadmin -refreshNodes` would work: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin [12:13:22] That works if you modify the list of excluded nodes (for decommissioning, maintenance etc.) and we recently automated that. [12:13:50] https://github.com/wikimedia/operations-puppet/blob/ba6cd490e047b1ca1893733908e1b7c42c23be83/modules/bigtop/manifests/hadoop/namenode.pp#L82 [12:14:48] Maybe it's worth trying this before doing a roll-restart, in case it works. [12:16:27] The command would be `sudo -u hdfs kerberos-run-command hdfs dfsadmin -refreshNodes` on an-master1001. [12:16:53] let me try that, thanks [12:17:30] (03PS5) 10Peter Fischer: Adapt schema to meet latest requirements. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) [12:17:34] hmm, ity' [12:17:56] *it's failing with `FileNotFoundError: [Errno 2] No such file or directory: 'dfsadmin': 'dfsadmin'`. Could there be a typo? I'll check out the --help [12:18:34] Oh yeah, there's probably an extra hdfs required in there, asfter the first hdfs. [12:19:28] it ran fine, but we still have the 5/8 nodes in the default rack [12:21:01] Yeah, looks like is not possible: https://community.cloudera.com/t5/Support-Questions/How-to-refresh-HDFS-rack-topology-changes-without-restarting/m-p/291352 but still odd why some get the new topology anyway. [12:22:03] So feel free to run the `sre.hadoop.roll-restart-masters` cookbook. There are some 10 minute pauses in it, to allow the new processes to settle before proceeding. [12:24:01] ack. It's ongoing [12:34:46] still ongoing, but `sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology` does not report any default/rack nodes anymore [12:40:34] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) Hi @lbowmaker, thanks... [12:41:20] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) ` brouberol@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics ` [12:42:31] btullis: the cookbook failed, with a socket timeout on an-master1001.eqiad.wmnet fyi [12:43:00] the step was `Run manual HDFS Namenode failover from an-master1002-eqiad-wmnet to an-master1001-eqiad-wmnet.` [12:44:52] Is there a way to restart a cookbook from the last step? [12:46:03] No, there is no way to restart a cookbook from the last step. The steps we will need are here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Manual_Failover [12:46:31] However, let's wait a while. We've seen this fail-back failure before, I 'm afraid. [12:48:59] ack. Seems like we're getting "Connection refused" errors on an-master1001, although all hadoop systemd services are up & running [12:49:50] Give the namenode on an-master1001 time to settle - it's got about 90 million things to load into its heap. [12:50:08] and I'm seeing established connections onto tcp port 8040 on an-master1001 [12:50:18] so maybe "just" a slow boot? [12:52:30] Yeah, it has to catch up on all of the changes while it was out. [12:53:45] The `connection failed` messages have gone away at least. [12:53:49] https://www.irccloud.com/pastebin/Fxa5WieE/ [12:54:02] good [12:56:43] What are the next steps at that point? Should we re-run the whole cookbook, or are we satisfied with the fact that all nodes have proper racks? [12:57:13] RECOVERY - HDFS topology check on an-master1001 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [13:02:21] I think it's just the failback operation that we need to do now. But let's wait until about 50 minutes past. That's a relatively quiet time on the cluster when fewer pipelines are active. Reduces the chance of another failure a little. [13:03:35] This was the first time that we saw this issue: T310293 - We haven't managed to find a great solution yet. [13:03:35] T310293: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 [13:07:40] * brouberol is taking a break [13:09:23] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10kai.nissen) I tested on aawiki. Looks good to me. [13:32:03] * brouberol is back [13:36:55] 10Data-Platform-SRE, 10Patch-For-Review: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) [13:42:34] brouberol: I'm late, but a few minor comments on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/955717 [13:44:30] ack. The patch was just merged, but I can send out a new review if necessary [13:44:51] have a look and see what you think! Nothing crucial at all. [13:47:27] I' [13:47:55] I've been wondering about the LVS service name resolution. Do you happen to know what convention we follow? [13:48:18] gehel: fwiw the way the batch classes are done allow to call the cookbook for a whole cumin alias or a specific query (that matches a subset of host of the allowed aliases) [13:48:32] so potentially it could match hosts in different clusters, for how that's currently done [13:48:40] I've tried datahubsearch.eqiad.wmnet, datahubsearch.eqiad.wikimedia.org, and others, to no avail [13:48:52] if you want to force it to act on a specific cluster additional checks to ensure that should be added [13:49:01] I don't think we have a strong enough convention to be able to automatically map a server to an LVS endpoint [13:49:24] ack [13:49:47] service.yaml in puppet and DNS repo [13:49:54] volans has a good point! So please ignore my comment! Or maybe add an inline comment clarifying what we're doing based on the discussion above [13:50:33] brouberol: all services behind an LVS have a dc-specific $name.svc.$dc.wmnet local VIP [13:51:13] then we have discovery records $name.discovery.wmnet that handles the active/active and active/standby services, that usually point to the svc records but might point to specific hosts in some cases [13:51:14] `datahubsearch.svc.eqiad.wmnet has address 10.2.2.71` nice [13:59:53] Oh, we missed the slot to do the failback. I forgot and got distracted. [14:07:46] ah, cr*p, so did I [14:08:07] should we schedule it for tomorrow next time? [14:08:10] *same time [14:08:46] Can we do it in 40 minutes instead? [14:08:55] sure thing [14:10:40] Everything works ok in this failed over state, but there is this niggly thing that makes logs a bit noisier until it's failed back. T338137 [14:10:41] T338137: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 [14:11:02] If the failback doesn't work this time, then we will leave it until tomorrow. [14:12:41] phuedx: milimetric: joal: I see that there is an email from Airflow about potential webrequest data loss. Anything we can do to help? I hope it wasn't caused by us. [14:19:33] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) 05Open→03Resolved [14:49:17] brouberol: Ready to go. Do you want to look at it together, or are you happy to run it yourself? [14:49:43] I'm happy to pair! [14:50:29] https://meet.google.com/mcv-ddgh-bis [14:56:06] btullis: I'm going through the runbook for dealing with data loss alarms (https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Dealing_with_data_loss_alarms) right now. I'll let you know the outcome :) [14:58:21] phuedx: Great, thanks. I'm going into a meeting right now, but let me know if I can help in any way. [14:59:10] !log successfully failed back the HDFS namenode services to an-master1001 [14:59:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:59:27] 10Data-Platform-SRE: Bring Hadoop workers an-worker11[49-56] into service - https://phabricator.wikimedia.org/T343762 (10brouberol) 05Open→03Resolved All workers are provisioned, registered in Hadoop and have their proper rack assignation. The system should rebalance HDFS blocks automatically. [15:03:27] btullis: that was a warning so it's probably good, sam and I can run the false positive checker [15:05:22] milimetric: OK, thanks. Good stuff. [15:19:11] milimetric: I'm not quite sure how to read the output of the checker. All rows have alarm_is_false_positive set to true [15:21:44] I believe that this change to the refinery deployment is ready to go, if there is an analytics refinery deployment happening today: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/950195 [15:22:23] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10Gehel) 05Resolved→03Open [15:25:56] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) Needs to be tested in the wild before calling it done [15:30:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [15:35:08] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2): Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Gehel) [15:37:03] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10Gehel) a:05gmodena→03BTullis [15:37:55] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Gehel) a:03lbowmaker [16:00:51] (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [16:04:46] phuedx: sweet, then yeah, it just means the data loss was a false alert, that missing sequence numbers were found in the hours before and after. I have huge trouble with that script and its output, but if you want to see me fumble through it, I'm happy to oblige [16:44:47] (03CR) 10Joal: [C: 03+1] "Indeed!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [16:53:37] 10Data-Platform-SRE: Implement depool (source only) and keep-downtime options on data-transfer cookbook - https://phabricator.wikimedia.org/T340793 (10bking) [16:53:39] 10Data-Platform-SRE, 10Release-Engineering-Team, 10Scap: "scap deploy"'s config-deploy should check for broken symlinks - https://phabricator.wikimedia.org/T342162 (10bking) [17:41:27] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 00), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10mforns) After a bit of investigation, I th... [17:57:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] eventutilites-python: fix type checking CI job - https://phabricator.wikimedia.org/T346085 (10lbowmaker) [17:57:20] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: eventutilities-python: Gitlab CI pipeline should use memory optimized runners. - https://phabricator.wikimedia.org/T346084 (10lbowmaker) [18:00:59] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [18:22:24] (03PS1) 10Sharvaniharan: Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 [18:31:31] (03CR) 10Sharvaniharan: "Hi Mikhail, tagged you here since you have context to this change. Please merge if everything looks good." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan) [18:51:53] (03CR) 10Bearloga: [C: 03+2] Minor change to stream name (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan) [18:52:08] (03CR) 10Bearloga: Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan) [19:33:51] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [20:09:17] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 3 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10AKanji-WMF) [20:17:19] 10Data-Platform-SRE: Troubleshoot rdf-streaming-updater/dse-k8s cluster - https://phabricator.wikimedia.org/T346048 (10bking) Patch above fixed the firewall rules, and we were able to get the flink-app to restore from savepoint. Closing this, but work continues in T345957 . [20:46:51] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) a:03bking [21:01:51] 10Data-Platform-SRE, 10Discovery-Search: Consider migrating search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10bking) [21:03:09] 10Data-Platform-SRE, 10Discovery-Search: Consider migrating search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10bking) [21:08:57] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 (10EBernhardson) This should be ready for deployment now. The rdf package will ne... [21:13:28] 10Data-Platform-SRE: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) Phab can only do parent/child task relationships, but I wanted to call out T346189 , as we're going to discuss migrating the search-loader service to Kubernetes. [21:14:04] 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) 05Open→03Resolved @bking @RKemper All disks are present [21:17:44] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) [21:26:51] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) [21:27:55] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) Thanks to @dcausse , the job has recovered and we've updated the docs. Moving to "Needs Review" so he can confirm the... [21:36:55] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) The flink-app in dse-k8s is healthy again, but I have no evidence that it's talking to Zookeeper. Will continue troubleshoot... [23:34:06] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks