[01:04:16] FIRING: NTPNoSynced: NTP not synced on dbproxy2007:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [02:54:01] RESOLVED: NTPNoSynced: NTP not synced on dbproxy2007:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [03:46:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993353 (10Papaul) I do agree with the 2 options however there is a possibility too that Frack will be taking a new rack if we do the codfw... [07:38:44] 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9993525 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for the investigation ! Seems like the last step was : ` asw1-b3-magru> restart a... [07:51:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993574 (10ayounsi) Or we could just use a IPv6 /64 and stop worrying about space :) Thinking more globally, if we were to redo the product... [08:42:47] new spicerack release in progress :) [08:42:54] https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1055152 for the changelog (if anybody has time) [08:43:18] no need for review, you're a pro now :D [08:44:24] * elukey foresees "just 42 small nits and you are good to go" [08:44:40] :D [08:46:44] sorry can't do right now, last page followups [08:51:26] XioNoX: fixed, lemme know if it is ok now [08:51:43] elukey: lgtm! [08:51:51] thanksss [08:51:57] waiting for CI and then I'll proceed [08:53:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993751 (10cmooney) >>! In T370164#9993574, @ayounsi wrote: > Or we could just use a IPv6 /64 and stop worrying about space :) One day :)... [08:53:37] elukey: I copied your line to https://gerrit.wikimedia.org/r/c/operations/software/homer/+/1054543/2..4 as well [09:36:35] spicerack 8.8.0 released! [09:36:55] at this point we can try to deploy to cumin2002 and test [09:37:16] XioNoX, arnaudb - do you have time to check on cumin2002 if I deploy the new spicerack pkg? [09:37:55] I can help of course, we could either test a cookbook or use some custom code [09:40:30] elukey: yep [09:41:29] let's wait a sec if others can join [09:43:03] I think that we can safely proceed, I'd need to double check redfish and then XioNoX can check netbox, the mysql stuff seems to be not impacting anything so they can be checked later on [09:43:48] I'm in a meeting, can be in a bit [09:44:53] XioNoX: cumin2002 updated, you can go ahead with netbox testing if you want [09:53:20] elukey: lgtm for the current netbox 3 compatibility, no regression [09:59:04] all good from Redfish as well [09:59:20] (tested with repl, didn't see anything weird with the new code attributes) [09:59:54] we can wait for volan*s or arnaud*b and then deploy to cumin1002 [10:14:43] elukey: I'm here [10:16:00] volans: so far redfish/netbox seems to work, do you want to test the more DP-related changes? [10:16:01] testing mediawiki first [10:16:05] and then mysqkl [10:16:25] okok [10:18:16] siteinfo works fine [10:39:54] elukey: all existing codepaths tested, I'll now test the new stuff, but that's not a blocker and means no need to rollback [10:40:11] so feel free to update also cumin1002 [10:59:19] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:21] Continuation of my RAID woes, I've managed to switch the RAID controller to HBA using scp_push, now I'm trying to reimage and it's stuck on booting d-i (I think) [11:00:26] Add puppet_version metadata to Debian installer Running IPMI command: ipmitool -I lanplus -H mw2432.mgmt.codfw.wmnet -U root -E chassis bootparam set bootflag none options=reset [11:00:28] Running IPMI command: ipmitool -I lanplus -H mw2432.mgmt.codfw.wmnet -U root -E chassis bootparam get 5 Running IPMI command: ipmitool -I lanplus -H mw2432.mgmt.codfw.wmnet -U root -E chassis bootparam get 5 [11:00:30] Checked BIOS boot parameters are back to normal [11:00:37] then it loops on wait_reboot_since [11:01:01] what the console says? [11:01:27] console shows Loading debian-installer/amd64/linux ok Loading debian-installer/amd64/initrd.gz ok Probing EDD (edd=off to disable) ok then blinking _ [11:02:23] did you just switch the raid or then do something on the disks too? maybe they need some setup to make them available for installation [11:03:40] arnaudb: FYI I've tested all the RO operations in mysql_legacy and so far looks good. If you can get a test host we could test all the other RW ones too. No hurry, those are new code paths and are not used [11:03:51] Just switched the RAID, which appears to have switched the disks to Non RAID as well, but maybe they do need further push. I'll kill the reimage and destroy the raid again :p [11:04:19] check in the bios raid settings if you see anything that might need changing [11:04:50] like to expose them, dunno, I don't know all the details of those (dcops might help too) [11:05:52] volans: you just lightened my afternoon thanks! :D I think db1215 is a good candidate, like last time (maybe lets double check with marostegui), I'll try to get to it this afternoon [11:06:45] ack ping me when is ready for some destructive operation :D [11:06:54] ack! [11:12:44] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Perc_H750_Raid_Controllers_(Virtual_Drive_ID,_Boot_Order) it would appear I need to set some BIOS stuff yes [11:52:33] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:43] FIRING: [4x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:47] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:59] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:22] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:55] RESOLVED: [4x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:16:51] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:19] RESOLVED: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:36] XioNoX: spicerack fully deployed [12:25:47] niiiice! [12:26:25] topranks: when is your network migration planned? [12:26:54] the week I'm NOT oncall :D [12:27:04] 15:00 UTC papaul is expected on site [12:27:42] and I guess "how long it takes" I've not really tried to estimate that, a few hours [12:28:11] topranks: it was to know if there was room for a Netbox upgrade, but it's a bit tight :) [12:28:26] I'll do that on monday [12:28:53] ok [12:29:20] tbh I've a load of netbox bits to do, my change would take an extra hour if I had to deal with the new UI :P [12:29:40] but it's fine if you want to do it still I can work with it ! [12:30:07] god damn I'm kicking myself for messing that up :( [12:30:32] nah, I don't want to rush it if there is any complications [12:30:49] don't beat yourself too much [12:31:03] it can happen, also you fixed it, and you deserve a t-shirt :D [12:31:22] topranks: that happens, especially in complex work.. Impact was also quite low overall. Good way of testing redundancy :) [12:31:52] sure, I know. but still [12:32:10] it's over-confidence, for the previous one I had everything tested N-ways with vQFX, but I thought this was "more of the same" [12:32:33] the problem was ssw1-a1-codfw and ssw1-a8-codfw learning the routes for the row c/d vlans from ssw1-d1-codfw [12:32:45] I guess one question is, with a very low amount of traffic to the site, why only 2 rows were not enough? I can imagine some of the masters being in C/D, but it shouldn't impact read operations? [12:32:47] in the previous migration there was no "inter-spine" connectivity like that to consider [12:34:05] well nothing could talk between the rows [13:58:28] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9994796 (10elukey) It seems clear that for the foreseeable future (next 6/8 months) we will not have the DHCP hostna... [13:59:02] XioNoX, topranks - I tried to collect the discussion with DCops about where/when to add the mac address of the mgmt interface for Supermicro nodes in https://phabricator.wikimedia.org/T365372#9994796. Lemme know what you think about it! [14:03:17] elukey: replied :) [14:03:22] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9994832 (10ayounsi) I'd suggest to abstract the device creation by a custom script or cookbook. This could run addit... [14:06:38] topranks: looks like we agree :) [14:06:55] heh yeah just reading your comment :) [14:07:11] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9994847 (10cmooney) There is possibly a variant of option 1: - Create a new custom script to add devices, which has... [14:10:38] so basically something like https://netbox.wikimedia.org/extras/scripts/provision_server.ProvisionServerNetwork/ but to be ran before it, aimed to add the device [14:10:54] I am still very ignorant about netbox, I am not seeing why this is better than a custom field [14:11:45] elukey: yep exactly [14:12:34] elukey: in terms of user experience it's roughly the same [14:12:39] elukey: it's better than a custom field as the script will put the MAC address where it is, instead of creating a new additional location to store that address [14:12:48] where it is/where it should [14:12:58] the benefit comes from using the built-in model of netbox to store the data, given that the model already has somewhere exactly fitting the requirement [14:13:09] yep [14:13:38] and in general a desire to try and not add too many custom fields, as it would be easy to go overboard with them I think [14:14:07] okok, just to understand, what would change if we add a new textbox to https://netbox.wikimedia.org/extras/scripts/provision_server.ProvisionServerNetwork/ though? [14:14:22] compared to the "add device" script [14:14:25] dc-ops said they don't want that [14:14:35] elukey: order of operations, iirc DCops run.. yeah that ^ :) [14:15:48] yep yep sure, but Papaul also proposed a compromise for option 2) that is not bad - Dcops would copy/paste the mac address in the racking task for the new server, so that they'd have it handy even when running the network provision script [14:16:04] I forgot to add it, just came up in my mind [14:16:42] sure, if that works for them then option 2 in general is slightly easier for us to implement [14:17:05] but the proposed new custom script wouldn't take too long so as long as it's not a major hassle for them [14:17:18] okok perfect, I'll try to ask again to them and will report back :) [14:17:25] thanks for the brainbounce! [14:19:14] custom scripts can do much more too, for example if the script is to be run for servers only, it doesn't have to ask for device type, status, airflow, etc. so it makes it easier for DCops, and prevents data entry issues [14:24:08] and IIUC https://netbox.wikimedia.org/extras/scripts/provision_server.ProvisionServerNetwork/ provisions both the mgmt and the primary interface right? [14:24:34] elukey: yep [14:27:49] okok makes sense.. so the idea is to have a new field in https://netbox.wikimedia.org/extras/scripts/provision_server.ProvisionServerNetwork/ called "mgmt mac-address" (or similar) that gets the value, validates it and then inserts it in the device's mgmt interface metadata [14:28:12] so in the provision cookbook we'll have it ready for the dhcp snippet etc.. [14:32:36] elukey: if that works for DCops that's fine for me. That seems like adding a new step though? (copy pasting the MAC on a task, with the risk of typoes, or copying it back from the wrong task, etc.) [14:36:16] XioNoX: yes I know, but you can typo even if you insert it directly in the DC etc.. [14:36:32] I think that we'll have to live with it until the dhcp hostname option is available [14:36:36] yeah, there are tradeoff for all the options [14:36:41] or better, I don't have better ideas for supermicros :( [14:44:14] there are unfortunately not a lot of options. to automate things we need a unique ID, so if we can't have the serial#, it has to be the MAC [14:45:14] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9995019 (10elukey) @Papaul the proposal that would be the best compromise is to add a "mgmt mac-address" field to ht... [14:49:44] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9995033 (10ABran-WMF) data-persistence hosts handled, ready whenever you are @cmooney [14:50:14] depends how complex you want the solution to be... :) [14:50:59] volans: let's use option 82 on the mgmt side! [14:51:01] elukey: actually I might just have got an idea... not simple, not ideal, but might work [14:52:32] volans: the premise is not great :D [14:52:42] we know the hostname from CLI input, we go to netbox, we know on which switch/port the host is connected, we go to the switch and we check which MAC addresses are learned on that switch port (can we get only MACs on a specific port? ideally just one?), if more than one we go to the install server and grep DHCP requests coming in in the logs (or tcpdump) with the learned MACs and if we [14:52:44] what is not good with the current proposal? [14:52:48] have a match we got our MAC [14:53:10] * elukey cries in a corner [14:53:30] it's inverted option82 for supermicro :D [14:53:44] I want a ™ for it [14:53:51] this to avoid a copy paste [14:53:54] volans: note that on mgmt the switches are not managed, so that won't work I think [14:54:07] XioNoX: this is for supermicro reimages [14:54:31] ah no right is for mgmt [14:54:33] doh [14:54:42] that's why I didn't think about it before :D [14:55:05] hey, thanks to https://phabricator.wikimedia.org/T363576#9994708 we might have a new reason to get rid of option 82 :) [14:55:44] did anyone checked if supermicro uses option 61 or 97 on its prod interface? [14:55:45] yes we discussed the plan in the office hours [14:55:52] the one you didn't came to, we missed you [14:56:10] volans: would it be possible to schedule the office hours during office hours :) [14:56:34] I'd be happy to anticipate a little the meeting [14:56:40] check with papaul [14:57:20] just kidding, I wasn't available at 6pm yesterday. But I usually am [14:57:51] the question is what to replace option 82 with ? [14:58:04] option 61, option 97 or MAC address [14:58:16] Dell doesn't do 61 on prod interfaces [15:11:47] it works with FQDNs too, not only IPs ;) reply to your task comment [15:12:09] TIL :) [15:12:35] it was like that when it was hardcoded [15:12:42] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995231 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8062b5f0-d6f0-401c-9dfd-590a5facd0ad) set by cmooney@cumin... [15:45:37] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:38] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995538 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fdebcc6c-adaa-42f3-809d-4ec381a4798d) set by cmooney@cumin... [16:12:36] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995556 (10cmooney) [16:21:15] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995596 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1b177f94-1995-41ab-90b9-673cef9dbf94) set by cmooney@cumin... [16:34:48] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f32e4714-9c03-456e-bc05-238c01bacbca) set by cmooney@cumin... [16:44:19] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995727 (10cmooney) [17:19:19] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:37] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996129 (10Papaul) ok +1 for /25 so we all okay thanks [18:41:12] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996287 (10cmooney) 05Open→03Resolved Work completed, traffic is currently bridged through the two spine switches over the AEs... [18:44:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996322 (10Jhancock.wm) ++ for /25 from me as well [18:53:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9996358 (10cmooney) [18:56:37] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996362 (10cmooney) GNMI stats proved very helpful to keep an eye on the bandwidth shifting around {F56509244 width=600} {F56509... [19:32:21] 10netops, 06Infrastructure-Foundations, 06SRE: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274#9996630 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d6a640fd-d19e-4aa8-930d-6c260b51a4c3) set by cmooney@cumin1002 for 3:00:00 on 4 ho... [20:28:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475 (10cmooney) 03NEW p:05Triage→03Medium