[00:21:13] toprank.s: thanks for the extensive debugging! I will try to follow along [00:21:51] re: the reboot, when I hit this error, I did a shutdown as I couldn't get past the "Broadcom" screen and papaul recommended that the shutdown (and a power cycle) works to fix that, which it did but then I hit this error towards the end [00:21:59] (not saying that the two are related but that I did reboot the box) [05:53:35] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [06:27:10] jbond, volans: there's an apt-get upgrade at the end of the Debian installer step before it reboots into the first Puppet run (to ensure that even if you e.g. install with an old Bullseye 11.5 image you get all the security updates that have been released since then [06:28:36] as such, if there's a not-properly installed package post reboot that points to an issue in our puppetisation (maybe some dependent component isn't installed from a component and instead falls back to some older package from the main repo or similar) [06:28:52] if we have a freshly imaged system which this problems shows up, happy to have a closer look [08:06:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:41] 10Puppet, 10Data-Services, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10jbond) a:03jbond [09:06:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:14] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) a:03Jelto Refactoring of omniauth providers looks good on all instances. Changes as expected. Thanks again @jbond for preparin... [11:21:14] 10SRE-tools, 10netops, 10Infrastructure-Foundations: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) p:05Triage→03Medium [12:20:58] slyngs: do you have alredy a test VM where we could test the changes to the reimage cookbook? [12:21:07] sretest1001 can be the physical one to test [12:21:33] we can just use one of the test VMs on the test clusters [12:23:18] So testvm2005 would be fine [12:24:00] great [12:30:00] So, is the easiest solution to have a local checkout on one of the cumin hosts? [12:36:33] slyngs: yes with a copy of the config.yaml pointing to your checkout and then using -c to point to the config [12:37:33] moritzm, jbond: I've a question on tftp... I see that we have /srv/tftpboot on both the install servers and the apt servers, with slightly different content, but much is the same [12:37:43] what is the logic behind what goes where? [12:41:16] and how that relates to apt.wikimedia.org/tftpboot/ that we use in the dhcp setting [12:41:38] and makes me wonder if we're doing it wrong btw :) [12:44:13] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Thanks for filing this one! I'm happy with the script in the private repo, but I think it would help if @ayounsi also had a quick look... [12:46:06] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10SLyngshede-WMF) [12:46:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10SLyngshede-WMF) p:05Triage→03Low [12:52:44] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [12:53:37] volans: there be some cruft from the times when there was a single server to serve the repo and the install servers, can you file a task for this? I'll have a closer look at some point, but currently don't have time for it [12:54:09] moritzm: Are you okay with me borrowing test2005 for reimage testing? [12:54:35] moritzm: it's mostly to understand where to put a new file [12:55:01] slyngs: can you use testvm2002? I need 2005 potentially again as a repro case for the d-i issue with low mem [12:55:23] I'm adding it to the install_server::tftp_server class and would like to know if there is a hostname I can reach it or have to use the IPs [12:56:05] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [12:56:06] moritzm: Sure, I don't need it to be Bookworm, Bullseye is fine. Thanks [12:56:55] volans: not sure what you mean? if you add it to install_server::tftp_server, it will be present on all install* hosts? [12:57:12] I expect that, yes [12:57:39] but for pxe for example we use http://apt.wikimedia.org/tftpboot/... [13:01:52] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [13:04:43] and I was wondering if there is an equivalent domain I could use to target tftpboot files on the install servers [13:06:49] we don't have anything running on the install* servers which serves files over HTTP(S), if you need that we'd need to add it [13:07:47] ok, I don't think so but was just checking. Thx [13:10:10] and no I wasn't looking for http but just to have a way to put a fixed address that would be resolved with the local install server [13:10:16] without having to hardcode the IPs [13:11:18] and now volans wonders if we're not serving boot images from the local install servers but from the central apt ones for all reimages... [13:13:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host testvm2002.codfw.wmnet with OS bullseye [13:14:06] mmmh I could add a new name and CNAME it to webproxy though.... [13:14:59] although that would be a double CNAME :/ [13:15:30] That's just asking for a circular CNAME at some point :-) [13:17:56] yeah I don't like it [13:22:24] topranks: jbond: thanks for the detailed debugging and related patches for the lvs2012 issues! [13:22:42] I guess my only question is that if this was indeed the issue, why didn't we observe it so far for any of the reimages? [13:23:39] the other question I have is that I had reimaged to lvs::balancer and still hit this error, so going by that, rp_filter should have already been turned off [13:23:57] and for the earlier LVS we reimaged, lvs2011, there was no such issue [13:25:09] I guess doing it insetup makes sure that the host is setup with rp filter disabled in the first step itself but I am not sure why it would matter in that case anyway, since the host will be reimaged to lvs::balancer eventually and it failed there [13:25:31] I guess I should note this on the ticket itself otherwise it will get lost in IRC! [13:25:48] sukhe: yeah I'm not 100% sure on what happened here [13:26:04] at what stage of the process do the vlan interfaces / changes to /etc/network/interfaces file get made? [13:27:46] you should bear in mind there were two problems here [13:27:52] 1) switch ports were not set up right [13:28:00] 2) rp_filter blocking traffic [13:28:29] right, that's a good reminder [13:28:42] I fixed the first issue manually in netbox, but it would have caused what you observed on its own [13:28:49] but I think 2 is a more explicit fix on the Puppet side (changing the rp filter in the kernel) and so lvs2011 should have had that same issue then [13:28:57] the updated puppetdb script I never stop talking about but never gets merged should prevent that issue :P [13:28:58] or any of the earlier reimages in the last month [13:29:18] which makes me believe that 1) is more likely to be the issue? [13:29:28] for 2 I understand what was happening, but I'm not too familiar with the conditions / order of ops that resulted in it being that way [13:30:24] we probably can't rule out 1 somehow affecting that but I'm unsure how it might [13:31:44] yeah that's where I am unsure about! [13:32:04] I am going to comment on the ticket for the historic context too and we can take it from there [13:32:12] thanks for all the debugging folks, you guys are the best <3 [13:33:06] are we talking only about problems after the install or also the reimage itself? [13:34:22] I think my take is that the switch ports were not set up right [13:34:35] simply based on how the other hosts went (no issues) [13:34:59] so I would say problems after the install but caused by issues before the provisioning? [13:35:05] not sure if that answers your question [13:36:13] ack, because anything fixed on hte puppet side would not help during d-i, that's why I was asking :) [13:36:33] If I had to summarize I would say the rp_filter issue happened becasue the puppet changes to /etc/network/interfaces happened before the /etc/sysctl.d/70-lvs changes [13:37:15] k [13:37:34] volans: yep, that's the other part of it, so Puppet is not even in the picture during d-i [13:37:50] yep [13:38:03] pxe and d-i ofc, I don't recall where the problem was exactly [13:41:49] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host testvm2002.codfw.wmnet with OS bullseye completed: - testvm2002 (**PASS**) - Dow... [13:43:20] ping -6 still fails for lvs2012 for me though [13:43:27] cumin2002 I meant [13:43:29] -4 is fine [13:43:47] but it was working!! [13:43:50] haha [13:43:58] sukhe@cumin2002:~$ ping -6 lvs2012.codfw.wmnet [13:43:58] PING lvs2012.codfw.wmnet(lvs2012.codfw.wmnet (2620:0:860:102:10:192:16:140)) 56 data bytes [13:44:00] what dark sorcery is this [13:44:08] let me have a look [13:44:10] yeah this one is weird [13:45:32] volans: Neat, sre.hosts.reimage can now handle VMs. I won't have time to test physical hosts today, but getting closer [13:45:50] great [13:49:00] SKIPPED: py39-flake8: InterpreterNotFound: python3.9 [13:49:00] congratulations :) [13:49:09] Thanks tox.... [13:49:50] I think the congratulation is a little misplace :-) [13:50:12] sukhe: the interface numbering has an issue [13:50:18] cdanis: I've uploaded wmf-laptop 0.5.7 to apt.wikimedia.org [13:50:49] both enp152s0f0np0 and vlan sub-int of it vlan2019 have an IP in the 2620:0:860:103::/64 range [13:51:21] hmmm [13:52:25] although that doesn't seem to be *the* issue [13:52:42] it was certainly _an_ issue, and is explained by the incorrect tagging settings on the switch before I corrected them [13:53:09] ok, I was about to bring a netbox question into it but I don't think that makes sense then [13:53:14] SLAAC working on the physical int, rather than the vlan one, then working on vlan one after fix, but addresses left on both [13:54:46] actually that does explain the issue [13:55:01] strangely the route remains even though I cleared the IPs from the physical int [13:55:26] https://www.irccloud.com/pastebin/fpQ0e4qR/ [13:55:42] topranks: I am not sure that how this is related but I will say it anyway :P [13:55:53] you remember how we were manually adding the IPs for the vlan interfaces? [13:56:02] somehow, they all seem to have gone away from netbox? [13:56:11] https://netbox.wikimedia.org/dcim/devices/3654/interfaces/ [13:56:28] we also set the right types (vlan* is virtual interface, not physical) but that also got reverted [13:56:36] ok [13:56:43] well there is no mystery there, I deleted them all :P [13:56:48] oh I see [13:57:04] the v4s are there but not the v6 [13:57:18] I didn't realise you use SLAAC autoconfigured IPs on the LVS? [13:57:24] We don't really do that anywhere else [13:57:57] Say look at the primary IPs on the main interface for lvs2012: [13:57:58] eno12399np0 UP 10.192.16.140/22 2620:0:860:102:10:192:16:140/64 fe80::1623:f2ff:fe4d:cd60/64 [13:58:16] The v6 address copied the IPv4 one in the host part of the address (10:192:16:140) [13:58:32] That is our normal pattern for all hosts [13:59:26] However I see for the other interfaces the LVS is using a SLAAC auto-generated IP: [13:59:27] vlan2019@enp152s0f0np0 UP 10.192.33.9/22 2620:0:860:103:262:bff:fecb:55d0/64 fe80::262:bff:fecb:55d0/64 [13:59:51] I wasn't aware of this. I did a purge on all SLAAC IPs from Netbox last week as part of a cleanup effort [13:59:57] oooooo [14:00:00] But we can change how things work to allow it again if needed [14:00:00] that explains it then I think [14:00:40] I am interested to know if there is a reason using, for instance, 2620:0:860:103:10:192:33:9/64 on vlan2019 would be a problem [14:00:50] topranks: I would have shared but I don't know the reason :) [14:01:11] I think bblack is the best person for this so probably you should mention this on the ticket [14:01:36] I'll open a separate ticket I think [14:02:19] the non-primary IPv6 on LVSes right? [14:04:49] yeah, keep it separate [14:08:33] moritzm: great, thank you! [14:17:42] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [14:24:12] I guess the question now is [14:24:28] if I should indeed proceed with lvs2012 provisioning and have it handle prod traffic or not [14:24:57] I guess we can reimage and try, as I am reimaging with a higher bgp med than lvs2010 so it won't serve prod traffic immediately, only when we remove the override [14:25:14] the question I do have in mind though is if we are worried about something else failing outside of our usual checks [14:26:43] I tidied everything up I think. But I'd probably reboot the box just in case to make sure before bringing up BGP. [14:27:09] ok yeah! otherwise you think we are good to go? as in, is there something else that is bothering you about this? :) [14:27:59] can't place where my OCD is bothering me about this but I think it's the fact that it's an LVS host and while they have had their own fair share of problems, I guess nothing like this :P [14:28:33] I will reimage and see how it goes then decide about removing the MED override [14:28:47] No I think I caught all the little bits left over from when the vlans were wrong [14:28:54] and the sysctl settings are there now [14:29:01] ok thanks, yeah I will give it a shot [14:29:04] but I'd still reboot just in case there is something I forgot [14:29:12] yeah I will thanks [15:02:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations: Access port speed <= 100Mbps False posatives - https://phabricator.wikimedia.org/T336511 (10jbond) p:05Triage→03Medium [15:38:00] XioNoX, topranks: I might conver the bash script to an ERB templatte anyway to re-use the public homer key from the private repo to avoid to forget to update it when rotating it [15:39:03] +1 [15:40:44] sounds good [15:41:08] XioNoX: I added those default mgmt interfaces and console ports to one of the device_types in Netbox: [15:41:08] https://netbox.wikimedia.org/dcim/device-types/201/interfaces/ [15:41:14] let me know if it looks ok [15:41:46] topranks: em1 is SFP (or SFP+) [15:42:03] I had thought that [15:42:28] https://www.networkscreen.com/images/QFX-Series/QFX5120/qfx5120-48y-rear.jpg [15:42:40] ^^ this could be an incorrect image of course but it seems to incidcate not [15:43:02] Maybe I'll ask pa.paul to confirm [15:46:44] sounds good! I thought wrong [15:48:05] topranks: not sure vme is needed [15:48:11] eg. https://netbox.wikimedia.org/dcim/devices/3570/interfaces/ [15:48:46] other than than lgtm. You can add the PSU as well [15:49:35] I was wondering about those, on existing devices they have serial numbers, so was unsure whether we should add a generic one and then have the serials added after? [15:49:54] so there is "power port" to add [15:50:07] and inventory items, and for the templates they don't take serial numbers [15:50:26] On the vme I will double check. I'm pretty sure for the ssw1 switches I hit a problem without it, but I note it's not on some of the other qfx5120s. [15:50:32] If we don't need it I'll remove [15:50:54] I have to step away but happy to have another look later on [15:51:36] I've to do the same soon too. I'm a tiny bit confused about the power ports - we can pick it up tomorrow [16:32:34] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans i have some switches ready for testing. 2 leaves in different rows and the 2 spines lsw1-a8 lsw1-b8 ssw1-... [17:00:12] thanks juniper: Also, make sure you specify an IP address, not a hostname, because name resolution is not supported. [18:28:45] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [19:38:25] 10Mail, 10Infrastructure-Foundations, 10MassMessage, 10WMF-JobQueue, and 2 others: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) I hadn't noticed but this happened to me as well on April 19. I'm about to target 100+ wikis again, crossing fingers :( [19:52:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [19:55:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [20:50:59] 10netbox, 10Infrastructure-Foundations, 10SRE: Error creating device in netbox - https://phabricator.wikimedia.org/T336547 (10Jclark-ctr) [20:51:15] 10netbox, 10Infrastructure-Foundations, 10SRE: Error creating device in netbox - https://phabricator.wikimedia.org/T336547 (10Jclark-ctr) [20:55:03] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Jclark-ctr i check again those servers from the switch side see below. Those are using NON-JNPR compatible cables. that is m... [21:30:24] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) Replaced both cables. they where newer wave2wave dac cables [21:46:44] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) ` Xcvr 31 REV 01 740-030077 H70824500300 SFP+-10G-CU3M Xcvr 5 REV 01 740-030077 G1807123036-1... [22:25:20] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [23:19:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [23:42:09] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Andrew yes we can still do the os install part and resolve this task when we will will be ready to do network changes we can...