[06:13:37] volans: re https://netbox.wikimedia.org/ipam/ip-addresses/16114/ it's because we assign a /32 per server and a /32 is automatically considered as a VIP in the Puppet import netbox script. Is it causing issues? [06:31:25] XioNoX: I'm not sure it can be called an "issue", I noticed it in the Netbox capirca script output being different from the other hosts, we have sretest2005.codfw.wmnet instead of sretest2005 as host definition (and only IPv4, no IPv6) and it generates a group called sretest.codfw.wmnet_group and is not part of the sretest_group group [06:32:22] ah right! please open a task, but the whole capirca stuff needs to be improved [06:36:32] k will do [06:41:52] 10netops, 06Infrastructure-Foundations: Capirca setup for routed Ganeti VMs - https://phabricator.wikimedia.org/T367265 (10Volans) 03NEW [06:41:57] mmmh I don't see sretest2005 as VM, maybe it's because of that, anyway T367265 ready :D [06:41:57] T367265: Capirca setup for routed Ganeti VMs - https://phabricator.wikimedia.org/T367265 [07:39:55] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9883256 (10ABran-WMF) [07:40:38] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9883270 (10ABran-WMF) [07:40:59] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9883271 (10ABran-WMF) [07:41:36] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9883272 (10ABran-WMF) [07:41:48] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9883273 (10ABran-WMF) [08:39:56] hello world, it's been ages since I last complained here! [08:41:21] I don't have an example at hand right now, but I've been running into debian installer complaining that "The volume group name used to automatically partition using LVM is already in use" and not continuing when reimaging wikikube-ctrl machines [08:42:51] The machines are keeping their hostname (which IIUC is used for coming up with the VG name) and I do a decom before the reimage (following the procedure at https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs) [08:44:04] was the wipe partition table step completed when doing the decom? [08:44:19] kamila_: and let me know about yesterday issue :D [08:44:52] volans: because I ran into this, I didn't move on to the next server yesterday, I'll have it today afternoon [08:45:12] ack [08:46:24] https://phabricator.wikimedia.org/T366204#9880133 seems to think partitions were wiped [08:47:27] I'll leave this to the d-i expert(s) in this channel ;) [08:48:10] volans: oh, you bored? I have another one! [08:48:37] (though that one might actually be for network or dc-ops peeps) [08:49:29] which is that after the first pxe boot I no longer see any dhcp packets when I retry the reimage cookbook and the "solution" is "come back tomorrow" [08:49:52] I don't know if the switch is eating the packets for some "security" reasons or if the NIC isn't sending any [08:50:38] the usual firmware issue? [08:51:13] I don't know what the usual firmware issue is, it's 10G NICs and I did explicitly tell the NIC to do PXE boot [08:51:15] double check the firmware version with dcops (or wikitech) [08:51:19] I have [08:51:26] (with wikitech) [08:51:26] then yes it's most likely that one [08:51:42] which that one? [08:51:53] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: [08:52:39] it happens also with those fw versions, I checked that [08:56:40] mmmh, weird, then yes more dcops/netops [08:56:48] ack, thank you volans <3 [08:56:58] I think that's the last one from this series, hopefully XD [08:57:15] * volans finger crossed [08:57:37] should I be pinging anyone specific for the installer VG issue? [09:29:33] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9883482 (10jcrespo) backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the t... [09:35:09] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9883497 (10jcrespo) backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so... [09:36:12] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9883498 (10jcrespo) backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the mainten... [09:40:59] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9883516 (10jcrespo) db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unles... [09:47:00] kamila_: the NIC in that host is on firmware 21.85.21.92 which is the right one, no need to change that [09:47:47] topranks: yes, I know, I checked that [09:48:18] I don't think what I'm seeing is a FW version issue [09:48:49] no it won't be [09:49:01] what is the current status of the host? it failed reimage due to PXEboot not working is that right? [09:49:36] well the last update on the task seems to suggest it completed ok? (https://phabricator.wikimedia.org/T366204#9883434) [09:50:11] I see it failed a bunch of times which is kind of odd [09:54:04] topranks: it was failing to pxe boot after the first successful pxe boot [09:54:14] (I had to restart the installer because of an unrelated bug) [09:54:24] and waiting 12h made it pxe boot again [09:54:32] this is the 2nd server I'm seeing this with [09:54:57] I didn't see dhcp packets on installserver during the pxe boot failures [09:55:21] ok [09:55:45] and today after the 12h wait it worked and reimaged successfully? [09:55:50] correct [09:55:58] hmm, the VG issue also just disappeared? [09:56:02] same behaviour last week with wikikube-ctrl1001 [09:56:22] yes, the VG issue is why I interrupted the first install and then the 2nd install today completed successfully [09:56:27] again same as 1001 last week [09:56:54] that's really odd [09:57:03] also with a bunch of failed pxe boots in between, I was poking at the NIC config and nothing worked, and waiting did [09:57:23] servers were purchased 2 years apart so probably not some defective batch or something [09:57:38] kamila_: have you any more reimages to do ? [09:57:48] yes, I have 1 more I'll be doing today [09:58:04] and then 3 in codfw whenever dc-ops schedules moving the boxes [09:58:12] ok [09:58:28] please ping me if you get the same issue on the one today and I'll try to take a closer look see what is going on [09:58:42] will do, thanks topranks [09:59:32] I wonder if it's possible the switch might be eating dhcp packets? I know some switches do that for "security" reasons, and it's plausible we normally wouldn't notice because usually we don't pxe boot twice in a row [09:59:41] or it's a fishy NIC, one of those :D [10:00:23] I think it's unlikely given how many PXEboots we do, like every reimage does 2 in quick succession (first the BIOS then the Debian-installer does it) [10:00:44] oh, okay, I wasn't aware there were two [10:00:44] and I'd have often done many more in quick succession troubleshooting other issues (like say config on the install server) [10:00:50] makes sense [10:01:07] well, I'll let you know next time I see it [10:01:09] can't rule it out - we've definitely had issues with Juniper's DHCP relay over the years [10:01:16] mhm [10:01:48] but our config at this point is pretty stable, we've thousands of servers so unlikely there is a bug we've only suddenly hit with these [10:08:00] fair point, thanks [10:10:10] (ftr, this is the behaviour that made me wonder whether the server was haunted last week :D but now that I have more datapoints, it seems like a config or fw problem + a bit of bad luck rather than outright ghosts) [10:15:52] I think it's a little too soon to rule out the supernatural here :P [10:17:56] true, it could just be a ghost that likes haunting consistently :-D [13:16:20] I posted a proposal for a quick improvement in homer's behavior when working on multiple devices: https://phabricator.wikimedia.org/T250415#9880854 [13:16:30] lemme know if you like it, it should be easy-ish to implement [16:04:00] o/ I am apparently having a different pxe boot problem with wikikube-ctrl1003 than the first two hosts, it's not pxe booting at all. I don't see dhcp packets on install server. I have checked firmware versions. [16:04:31] Interestingly, this one had an extra option in the "legacy boot protocol" menu in NIC settings, so it is different from the first 2 hosts at least in that. [16:04:46] Any hints for how to debug it? [16:13:26] (ok, what's different is that this host got a BCM57414 NIC while the first two got a BCM57412... but it's on the FW version specified at https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation ) [16:24:34] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9885481 (10jhathaway) [17:01:49] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9885755 (10elukey) We have sretest2001 racked and connected to mgmt network, and it is a Supermicro node. I tried to... [20:47:33] hello I/F friends - network question: for cross-DC (codfw <> eqiad) network demand, is there a "floor" below which we don't care about usage from a capacity planning perspective? [20:47:33] context: we're fixing some cassandra client configuration, which will allow a handful of services to talk to the full cluster (which spans both core DCs) while restricting all "real" traffic to the local DC. the only cross-DC traffic is some health-checking chatter, but in aggregate it will add ~ 56kB/s (bidirectional). [20:52:42] 10SRE-tools, 10Observability-Logging: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9886699 (10colewhite) [21:01:30] swfrench-wmf: I would assume that would meet the floor [21:01:41] I'm not sure if we have it documented anywhere [21:02:29] but the networking folks may be able to point to some docs, I can only find this, https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Congestion [21:04:53] jhathaway: thanks for your response - I'd not seen those docs before. just to confirm, by "meet the floor" you mean "below the floor we care about" or "needs consideration" :) [21:05:20] I assume that is below the floor we care about [21:06:06] but top.ranks or X.ioNoX would be the authorities [21:06:40] feel free to pop an email to our team, if you want a definitive answer [21:07:07] jhathaway: great, thanks for confirming :) I'll hold off on making any further changes that would make this usage a reality until I've had a chance to talk to them [21:07:41] email also works, is there a specific team list I should use? [21:08:26] swfrench-wmf: sre-foundations@wikimedia.org [21:08:37] fantastic, thanks! [21:08:41] of course