[07:25:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:15] o/, noob question: attempting to reimage wikikube-ctrl1001, I get a blank console when it starts PXE boot, should that make me suspicious of the new NIC/fiber or would that look different? Where do I look for logs or something? [08:39:45] kamila_: did you fix the dns? is it back to its old one? [08:41:24] volans: yeah, mgmt interface works [08:41:56] and when I stare at the console, it reboots happily into PXE and then gets stuck/blank [08:44:29] kamila_: check https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues and https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting [08:44:45] oh, thanks a lot volans [08:49:19] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9862669 (10cmooney) [08:56:05] Hey peeps, what's the ttl for the recursive dns to refresh? dig @ns0.wikimedia.org wikikube-worker1001.eqiad.wmnet resolves correctly, but dig @dns1005.wikimedia.org wikikube-worker1001.eqiad.wmnet (this is the anycast dns host for cumin that dig @10.3.0.1 CHAOS TXT id.server. +short gives) [08:56:16] s/but/but not/ [09:01:01] Ok it had a negative cached value for this record [09:01:34] you can use the wipe-cache cookbook [09:01:41] yep, just did [09:01:48] took me a little digging :) [09:02:00] I wonder if we should add that step to the rename cookbook [09:03:49] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818/3..8/cookbooks/sre/hosts/rename.py#b149 [09:03:52] :D [09:08:01] Ah I guess since it failed once, and I queried the name, it had negative cache? [09:09:17] yep [09:09:27] was it in the last 24h? [09:09:40] I don't recall the negative cache TTL but might be in that range [09:10:59] It was on monday end of day iirc [09:11:04] I think gdnsd default is 10800 and I think we don't override it [09:11:09] so dunno [09:11:38] I'll revisit if I run into it again, for now it's good enough to know I can wipe that neg cache easily enough [09:47:41] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9862852 (10akosiaris) [09:50:50] FYI, I'm rebooting netbox instances in a few [09:51:57] ack [09:58:29] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680 (10MoritzMuehlenhoff) 03NEW [09:58:36] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9862914 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:00:45] FIRING: [2x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:45] FIRING: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:54] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863049 (10cmooney) @Jclark-ctr @VRiley-WMF unfortunately these switch upgrades require us to shift some cables around before/after the upgrade to avoid disrupting services.... [10:28:35] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863056 (10VRiley-WMF) @cmooney as it turns out, I will be out until June 10th. [10:29:51] claime, volans: bit late but I think our negative cache TTL should be only 10 mins? [10:30:06] The last number in the SOA record is 600, which I believe controls this? [10:30:06] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wikimedia.org#3 [10:32:51] 10 mins for wikimedia.org, 1 hour for wmnet it seems [10:40:45] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:47] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863128 (10cmooney) >>! In T366361#9863056, @VRiley-WMF wrote: > @cmooney as it turns out, I will be out until June 10th. No probs, enjoy the time off. I'll see if maybe J... [11:11:17] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863237 (10Volans) I'd like to know if there is a wider agreement on this before implementing it. It seems reasonable to me but it will affe... [11:12:41] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863251 (10MoritzMuehlenhoff) Sure thing, but there's also no real impact, anyone who continues to pass the --alias for these kind of cookbo... [11:16:40] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863260 (10Volans) To add a dumb change that makes aliases and query optional and then checks for them later is easy. But at this point it w... [11:36:49] I'm also rebooting the netbox staging host in a few [11:38:44] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:02] hello folks! [12:51:11] I am going to reboot k8s-aux nodes [12:51:17] cc: cdanis [12:58:37] moritzm: I think that the network probe for kartotherian is not working: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1 [13:02:45] Get "https://10.2.1.13:443/osm-intl/6/23/24.png": x509: certificate is valid for maps.wikimedia.org, kartotherian.svc.codfw.wmnet, maps2007.codfw.wmnet, not kartotherian.discovery.wmnet [13:03:50] not sure about the purpose of this, that's just a mis-matched check, the service appears to be working? [13:04:29] could it be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039211 ? [13:04:43] elukey: ack thanks [13:05:02] I think that a SAN is missing [13:05:02] moritzm: well it is kind of just luck that nothing but the probe is sending traffic to kartotherian.discovery.wmnet [13:05:07] that SAN should be on the cert [13:05:19] elukey: extremely likely, yes. let's doublecheck with PCC and merge [13:05:39] fixing CI [13:06:03] I checked via openssl s_client and I don't see the discovery san, so it should be good [13:06:06] running pcc too [13:11:59] rolling out the change to maps nodes! [13:16:08] done! The probes are working now [13:16:24] and aux-k8s should be all rebooted (etcd + control plane included) [13:16:28] ack, great [13:28:44] FIRING: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:22] not sure why those failed, I just cleaned up old configs --^ [13:30:26] (see SAL) [13:32:52] yep alerts cleared [13:33:27] ok I am going to factory reset sretest1001 to check if my change to the provision cookbook works fine [13:33:33] anybody working on it? [13:33:44] RESOLVED: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:09] elukey: I'm merging your Hiera change for sretest1001 (was prompted during anothjer cookbook run) [14:04:18] (netbox-hiera) [14:07:31] thanks! [14:10:04] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9863897 (10ayounsi) Plan so far is to merge https://gerrit.wikimedia.org/r/1037784 to be able to have a puppetized test server compatible with the new deploy directory scheme (netbox-... [14:32:34] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863988 (10cmooney) [15:32:38] rebooting netboxdb hosts [15:56:06] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9864343 (10elukey) First roadblock: https://www.supermicro.com/en/support/BMC_Unique_Password It seems that every s... [17:12:15] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9864875 (10jhathaway) [18:57:41] 10Mail, 06Infrastructure-Foundations, 06SRE: Provision mx-out - https://phabricator.wikimedia.org/T325407#9865258 (10jhathaway) [18:58:56] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9865254 (10cmooney) a:05MatthewVernon→03cmooney [19:01:22] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9865262 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney [19:01:24] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740 (10Dwisehaupt) 03NEW [19:02:00] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9865284 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:03:36] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865298 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney [19:03:39] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865304 (10cmooney) [19:04:03] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865307 (10cmooney) >>! In T365988#9837257, @MatthewVernon wrote: > From the swift POV, this is just checking the cl... [19:06:38] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9865316 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:06:50] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9865331 (10cmooney) [19:08:59] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9865360 (10cmooney) [19:10:01] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9865354 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:11:31] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9865362 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:12:39] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9865379 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:13:17] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9865417 (10cmooney) [19:13:27] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9865412 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:16:57] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9865429 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:17:04] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9865435 (10cmooney) I spoke to @Jclark-ctr earlier, we will do this commencing at 12:00 UTC tomorrow Thurs 6th Jun. [19:21:28] 10Mail, 06Infrastructure-Foundations, 06SRE: Provision mx-in - https://phabricator.wikimedia.org/T325406#9865447 (10jhathaway) [19:44:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9865522 (10cmooney) >>! In T360789#9855905, @Papaul wrote: > @cmooney all good on lsw1-d4, lsw1-c2 and lsw1-d8 Thanks! Confirmed all looks good. What was... [20:25:51] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865652 (10Dwisehaupt) @jhathaway Question about the routing of mail with these hosts. Currently the civicrm host receives mail... [20:30:04] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9865675 (10Dwisehaupt) a:03Dwisehaupt PFW and iptables changes pushed. Awaiting pfw rollout (T366753) before we can test. [20:34:57] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865703 (10jhathaway) [20:38:25] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865710 (10jhathaway) >>! In T365395#9865652, @Dwisehaupt wrote: > @jhathaway Question about the routing of mail with these host...