[09:05:04] o/, I still haven't been able to reimage wikikube-ctrl1003, it's not pxe booting and I'm out of ideas, could someone please help? or am I asking in the wrong place? [09:05:49] I have checked FW versions and settings with the dell documentation and I do not see dhcp packets on the installserver [09:42:49] kamila_: hey [09:42:56] let me have a look [09:43:02] o/, thank you topranks <3 [09:44:44] it's saying the link is not coming up for some reason [09:44:46] https://usercontent.irccloud-cdn.com/file/I202nWac/image.png [09:46:21] firmware is 21.60.22.11, slightly older than 21.85.21.92 which we normally go for but normally that should be fine (it's v22+ that gives us issues, and once we get into the debian installer environment) [09:48:01] potential loose cable? [09:48:30] yeah maybe, the switch is also showing the port as down [09:48:37] cmooney@asw2-c-eqiad> show interfaces descriptions | match wikikube [09:48:37] xe-2/0/1 up down wikikube-ctrl1003 {#2966} [09:55:40] ok, I will prod dc-ops about it then, thank you [09:55:57] how did you find that? [09:56:23] oh, the switch CLI that I don't have and don't want to have access to [09:56:26] right? :D [09:56:31] correct [09:56:44] but also Media test failure, check cable in the boot screen [09:57:39] oh, does the https interface show something other than the ssh one? ô.ó [09:58:08] the screenshot cathal pasted above, should be the same you get from the ssh console [10:00:22] often the ssh one doesn't have that detail actually [10:00:35] it's only on the vga output for some reason [10:01:11] excellent '^^ [10:01:22] kamila_: I checked through the settings here all seem fine, on all sides it's showing the link as not coming up :( [10:01:45] The DAC cable being used is one of the rarer kinds, I've not seen before and lacks a Juniper revision coding, might be an older one [10:02:08] thank you so much topranks <3 between this and all the other things I was getting desperate '^^ [10:02:11] I think the next step is to ask DC-Ops to replace the cable and see if that changes the status [10:02:18] yep, will do [10:02:29] no probs - sorry I had no good answer! [10:02:52] this is the cable type fwiw, "the mate company" not seen that before [10:02:54] 1 10GBASE CU 1M n/a THE MATE COMPANY C9999-1M-P n/a 0.0 SFF-8472 ver n/a [10:05:11] topranks: that's an excellent answer, because it's actionable, unlike my "it's not booting and I don't know what to do" :D [10:05:55] yeah, it's a bit annoying though, and I'm wondering what went wrong with the previous two which didn't work, and then did [10:06:29] I was hoping this was the same but it seems like a physical issue ¯\_(ツ)_/¯ [10:15:15] topranks: yeah, this one is a different NIC though [10:15:34] oh is it? what NIC was in the others? [10:15:38] I will let you know if I encounter the original problem, that one was pretty clearly not a physical thing [10:15:56] the something something 2 rather than something something 4 [10:16:05] I think this one has a BCM57412 which is the typical one we have [10:16:24] hmm, maybe the others were BCM57414, which is the 10/25G vairant of that card [10:16:27] good to know, thanks! [10:16:37] yeah, either that or the other way around, it was something like that [10:17:41] but the original problem happened after 1 successful pxe boot, so it's still possible I'll catch it here once the cable is swapped :D [10:23:10] positive thinking... I like it :D [12:28:13] 10netops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9888375 (10MatthewVernon) Just to note that per [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&va... [12:36:34] moritzm: when are you planning to reboot cumin2002? [12:38:24] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9888452 (10elukey) IIUC we are missing DHCP's option 12 from the BMC's client. On DELL's we expect something like:... [12:38:55] hello folks! I am going to reset to factory defaults sretest1001 [12:39:00] lemme know if anybody is working on it [12:43:22] elukey: go ahead for me [12:47:49] volans: in 15 mins from now, should I wait for something? [12:47:59] no, the opposite, we wait for you :D [12:48:07] arnaudb, elukey: for you would it work to release and test spicerack on cumin2002 right after the reboot? We'll be sure no cookbook is running and can also ask mor.itz to hold off a bit on telling everyone they can restart using it... ;) [12:48:13] then let me do it right now [12:48:27] the main issue was backups completing and I chcked with Jaime before that these are done [12:48:44] no hurry on our side, I'm not even sure the others are ready [12:49:33] it's unused now, I'll go ahead in a few minutes [12:50:19] volans: sure, we can use sretest1001 to run the provision cookbook and double check as well [12:50:22] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408 (10cmooney) 03NEW p:05Triage→03Low [12:50:30] I just need to see if it reports the Hostname option [12:55:14] volans: I see Hostname (12), length 13: "idrac-XXXXX" for sretest1001.. so it seems that supermicro's BMC doesn't set it by default :( [12:56:31] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9888510 (10elukey) I can confirm that the sretest1001's BMC sends this: ` DHCP-Message (53), length 1: Discover Hos... [13:00:25] and https://www.supermicro.com/support/faqs/faq.cfm?faq=24257 seems to indicate that the BMC hostname needs to be ste [13:00:28] *set [13:01:39] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9888532 (10cmooney) 05Open→03Resolved [13:04:34] volans, elukey: cumin2002 is rebooted, go ahead [13:04:55] super [13:05:56] moritzm: thanks a lot [13:06:06] arnaudb: let us know when you're around too [13:12:57] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-deploy repository - https://phabricator.wikimedia.org/T367410#9888607 (10elukey) [13:16:55] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-deploy repository - https://phabricator.wikimedia.org/T367410#9888639 (10elukey) Created https://gitlab.wikimedia.org/repos/sre/python-deploy @Volans we can change the name if you want, otherwise please push the first version of the c... [13:18:38] * arnaudb backlogs [13:20:45] I'm around volans elukey :) [13:20:58] ok, give me 5 and we can start [13:21:56] ✋ [13:25:30] I have a meeting in 5, will be back in ~30 mins [13:27:25] elukey: should we wait for you? [13:27:33] nono please go ahead [13:27:43] we can test later on the provision cookbook, is it ok? [13:30:16] ack [13:42:09] 10netops, 06Infrastructure-Foundations, 06SRE: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9888712 (10cmooney) We could use these cables but the host side but we might not have enough slack to connect to servers at dif... [13:47:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9888730 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94b81d4d-316b-4c68-b4a9-a2d07057d180) se... [14:04:55] back! [14:09:46] elukey: cumin2002 has the new spicerack, we're testing the mysql_legacy stuff, if you could go through the redfish part that would be grat [14:09:49] *great [14:11:16] will do it in a sec [14:35:34] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9888942 (10elukey) [14:48:38] elukey: we're done with the mysql_legacy, how's going with the redfish? as long as there is no regression it's ok [14:49:06] volans: I tested some functions, and the provision cookbook for sretest1001 is ongoing but it looks working [14:49:13] so I'd say we are ok [14:49:16] great [14:49:21] need a hand? [14:49:24] I'm available [14:50:26] nope just finished [14:50:28] all good [14:52:33] yay [14:52:41] so no rollback needed [15:01:59] arnaudb, elukey: so ok for you to upgrade cumin1001 too? [15:02:08] *1002 [15:02:17] ok for me! [15:03:36] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889146 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=891c00a3-b649-4659-b39f-5ad6b01367a9) se... [15:04:47] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889149 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5a6a58c5-4681-4aea-8e80-e8ba2c613022) se... [15:05:54] +1 [15:06:02] ack thx [15:06:07] {done} [15:10:17] elukey: for the curious did you take a pcap of the SuperMicro DHCP packets? [15:11:48] I think so [15:12:08] but AFAICT at least from the text version on console I did there is no hostname :( or anything else distinguishable AFAICT [15:12:28] boo supermicro ! [15:14:03] what surprises me is that I really thought that this was tested back then when we had the first test host, but I can't find any proof on phab [15:14:55] topranks: o/ there is sretest2001 that keeps sending dhcp requests to install2004 right now, if you want to take a look [15:15:03] no Hostname(12) set :( [15:15:28] volans: yeah I was sure I seen it too [15:15:31] oh well [15:15:49] is this DHCP for the mgmt network or from the main NIC? [15:15:57] the former [15:16:36] I asked dcops to try to set the hostname for the BMC, IIUC from reading supermicro's doc after that it should start sending the hostname in the dhcp requests [15:16:46] but it wouldn't be really good for us of course :( [15:16:52] is that how we do it right now for the Dells? [15:17:11] sorry I'm forgetting how we do the mgmt dhcp [15:17:12] yes, use hostname that is set to idrac-$SERIAL [15:17:22] and we do match on that [15:17:37] topranks: like this https://phabricator.wikimedia.org/T365372#9888510 [15:17:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/dhcp.py#164 [15:18:47] volans: thanks yep, I guess we're using the serial mostly then right? [15:19:40] but we're flexible, whatever the BMC sends us that we can get from Netbox is fine :D [15:21:49] thanks to t.opranks' help, I have the next problem ready :D https://usercontent.irccloud-cdn.com/file/fvjBwFbW/image.png [15:22:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889279 (10cmooney) Switch has reloaded on the new version, all looks good at first glance. ` cmooney@lsw1-f6-eqiad... [15:22:15] that's wikikube-ctrl1003, I decommed it yesterday and it claimed to have wiped the disk [15:23:25] kamila_: id that the one we were looking at earlier? it got beyond PXEboot now? [15:23:41] topranks: yes, turns out the switch port was bad [15:25:02] kamila_: does wikikube use a partman recipe that re-uses /srv across reimages? [15:25:40] I don't know, I have to check [15:26:40] thanks for the hint [15:28:48] kamila_: who in DC-ops were you dealing with on the port change? [15:29:14] not sure we've had any of those before, I'll probably need to talk to them work out what's happened and find a way to mark it so it's not re-used [15:29:14] topranks: VRiley [15:29:17] ok thanks [15:29:32] she said the SFP connector was stuck in it :D [15:32:11] topranks: and here's a fun fact, this box isn't doing the "won't pxe boot for the next 12 hours" thing [15:32:24] so yeah, smells like a firmware rather than switch thing [15:32:28] ah ok [15:32:39] nice, I was just describing that fun problem that sometimes happens : ) [15:33:00] that also means we won't accidentally re-use the port so solves my worry too [15:33:17] true :D [15:33:26] yeah the "not pxebooting" issue I really didn't understand, if we hit that again ping me [15:34:46] will do, thanks [15:37:23] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889405 (10MatthewVernon) Swift looks good, thanks. [15:38:22] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:45] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:45] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889580 (10cmooney) 05Open→03Resolved Thanks for checking things, all stable on our side I will close the ta... [15:58:45] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:34] 10netops, 06Infrastructure-Foundations, 06SRE: No IPv6 ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 (10cmooney) 03NEW p:05Triage→03High [16:08:45] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:23] general question about debmonitor-client - IIUC we just report if the OS is Debian or not, but have we ever thought about adding the version too? [16:09:58] basically I am checking https://debmonitor.wikimedia.org/hosts/ and https://debmonitor.wikimedia.org/images/, for the latter it would be really great to know what debian version each image runs [16:10:19] so that it would be quick and easy to spot old images, that we can clean up or that we need to upgrade [16:10:41] for the hosts we have puppet reporting facts, but not for the images [16:15:45] we didn't by design, because in theory you can do any kind of frankenstein mixing sources, backporting kernels, etc... [16:16:27] for hosts for example you can easily use python3-wmflib that is installed everywhere and has a specific version per-distro [16:16:35] for images I'm sure we can find an equivalent [16:17:32] mmm I am not getting the frankenstein - even if you have corner cases, you do have a specific debian version for each host/image [16:18:02] reporting that it would be really useful, and more understandable and user friendly that looking for a package version :) [16:18:07] what is a debian version? if I add an ubuntu sources.list can I still say it's a buster? [16:18:57] come on we don't do that, and we could use /etc/debian_version as canonical source of truth [16:19:59] I would argue that if you use ubuntu sources.list then calling the os as "Debian" wouldn't be right anyway [16:20:04] but we do it [16:20:26] I am just suggesting that we report Debian $version instead of simply reporting Debian :) [16:21:26] and I have a valid use case - we'd need to be better in poking people when deprecating os support, otherwise we move production from a new stable but in k8s we leave things as they are [16:21:42] what I'm saying is that lsb_release -a doesn't tell the whole story [16:21:56] anyway IIRC was something mor.itz had an opinion on [16:22:01] but my meories are fuzzy [16:23:22] RESOLVED: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:56] oookok [16:24:07] logging off for today folks, have a nice rest of the day o/ [16:24:39] elukey: T240193 [16:24:40] T240193: debmonitor: show OS release name in the host view - https://phabricator.wikimedia.org/T240193 [16:25:53] thanks! [16:26:06] I guess we can re-evaluate if needed :) [16:26:16] it was a long time ago in the pre-k8s era, let's re-eval :) [16:26:19] as for the specific problem you're trying to solve if you give more context maybe I can help [16:30:13] 10netops, 06Infrastructure-Foundations, 06SRE: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9889810 (10cmooney) [16:32:02] definitely I'll reopen the task and add the use cases :) [16:33:18] I meant right now with some hack query :D [20:32:58] 10netops, 06Infrastructure-Foundations, 06SRE: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890729 (10cmooney) It seems this was an inadvertent result of the upgrade to the codfw row A/B switches, and the move there from a purely L2 switching layer to a rout... [20:35:47] FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:30] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466 (10CDanis) 03NEW [20:43:23] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9890786 (10CDanis) [20:44:13] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9890789 (10CDanis) [21:13:45] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:03] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890880 (10cmooney) I've pushed this change to cr2-eqdfw and it seems to be doing what we need there: Codfw /48 is announced to Facebook: ` cmoo... [21:25:47] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:28:08] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890909 (10cmooney) I'm monitoring the change in traffic levels. Right now it seems negligible, however that is not much surprise, prior to the... [21:44:19] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890956 (10cmooney) Just to note that for the same time period (since March 5th) we've not been announcing the codfw aggregates from eqord: ` cmo... [22:30:47] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:45] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:47] RESOLVED: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed