[09:41:02] topranks: good morning, for when you'll be around.... how did it go with the dhcp issue andrew was having? Is there anything I should look into? [09:42:18] volans: hey. I didn’t get to the bottom of it unfortunately. [09:42:52] Symptoms are like the DHCP replies are not getting back to the host, but every check I could do suggests they are. [09:42:57] np, I'm trying to get backlogs from the various chans, but if you have a gist of it might be quicker :D [09:43:02] But I’ve not been able to prove that. [09:44:17] I was thinking I might boot the system from an ISO from the iDRAC and observe what happens when a DHCP req is issued. [09:45:06] No suggestion there is anything wrong with the cookbook or install server setup that I could find. [09:45:18] So don’t think there is anything that needs your attention right now [09:46:02] let me check something, to make sure that the host has PXE enabled on the right iface [09:46:06] TL;DR the host sends multiple DHCP requests / renews, then boots from HDD. [09:46:36] Yeah if it sent the HTTP req on wrong interface that might make sense [09:49:01] topranks: it ended up working, right? [09:51:50] XioNoX: I don't think so, certainly not before I had to log off Friday [09:52:00] maybe I missed some updates though [09:53:10] One thing that occurs to me now is that only the initial packet from the host is a DHCP DISCOVER [09:53:25] Subsequent ones are DHCP REQUESTS to renew the least for the IP it was assigned. [09:53:45] Which clearly suggests the host is getting the responses from install server [09:54:38] if it was going out of the wrong interface it would not gets to the DHCP server neither, and if the option 82 was incorrect, it would not send a reply back [09:56:09] yep [09:56:15] has anyone looked at the console while the host is rebooting into pxe? [09:56:56] Andrew was but yeah perhaps one of us should do so [09:57:19] and remind me if the hw firmwares were upgraded or not [09:57:58] yeah Rob says on the ticket he "updated the firmware of the bios, both network cards, raid controller, and backplane." [09:58:18] I'm not sure if the iDRAC was updated. [10:06:23] according to netbox the host has 2 10G nics connected to the switch, from redfish API I can see that PXE is enabled on the iface called NIC.Integrated.1-1-1 that has MAC BC:97:E1:A7:3B:D8, that on the host matches eno1np0. [10:06:23] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [10:06:29] lol stashbot [10:07:14] it's not possible that the switch is sending back some traffico on the wrong nic right? [10:07:25] what do you mean? [10:08:19] that it has 2 cables to the swtich as opposed to most of our servers [10:08:44] I was wondering if there is any possiblity that this could be a problem here [10:09:10] I doubt, but just making sure we can exclude it [10:11:54] Not that I can think of [10:12:25] if the DHCP replies it means the request is coming from the interface that is supposed to have the IP as declared in Netbox [10:12:58] yep but are we sure the reply goes back up to the server? [10:14:16] I'm fairly certain that the fact the server's second and subsequent DHCP REQUEST packets are trying to renew the lease for 10.64.20.52, we can assume the initial response to it's DHCP DISCOVER did get to it. [10:15:30] ok [10:15:36] The switch only knows that MAC on one port (as it should), and should send the reply only to that port. [10:16:43] Going back to the original logs you posted volans, why is it trying to renew the DHCP lease 4 seconds after it got it: [10:16:55] Dec 2 01:55:44 install1003 dhcpd[19332]: DHCPOFFER on 10.64.20.52 to bc:97:e1:a7:3b:d8 via 10.64.20.2 [10:16:55] Dec 2 01:55:48 install1003 dhcpd[19332]: DHCPREQUEST for 10.64.20.52 (208.80.154.32) from bc:97:e1:a7:3b:d8 via 10.64.20.3 [10:20:17] The PXE boot could be failing, and it re-tries DHCP, but that seems a little quick for that process. [10:20:38] I think it's probably worth checking with cloud services if we can re-try this, and one of us observe the console and see if we notice anything [10:24:23] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Kormat) 05Open→03Resolved >>! In T296285#7545417, @jcrespo wrote: > ` > from:... [10:46:33] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) @JAllemandou This is great, thanks! Note that we can tune sampling to adapt. What would be the next steps? [12:31:28] topranks: fwiw I don't see any log "Serving lpxelinux.0" in syslog on the install server related to cloudvirt1028 IP, that means that it was never requested AFAICT. [12:34:24] yep I came to the same conclusion. [12:36:07] But I am satisfied that the DHCP reply has the right URL in it, and the reply is getting to the host. So why it doesn't initiate, or fails to initiate the PXE boot is the question. [12:36:22] "fails to properly" [12:38:22] ack [12:38:26] getting lunch, bbiab [13:06:21] moritzm, jbond et all: are you looped in on friday's mx2001 issue yet? [13:07:23] at al* [13:11:30] diffscan notified us that it's not reachable publicly, is there a task? [13:11:54] there was an incident [13:12:29] ah right, in _security's topic [13:12:29] there is an incident doc, mutante was IC but said he didn't get to capture everything and was going to go back to fill it in [13:13:15] basically mx2001 was receiving half of the emails as a result of some recent changes (all reverted now I believe, including the LDAP change that was at the end unrelated) [13:13:19] but was unable to deliver them [13:13:32] the latest theory we had was that it's a 5.10.70 conntrack bug [13:13:47] basically the SMTP conversation was getting stuck after DATA/BDAT (resembling an MTU issue) [13:14:00] but as soon as we added an iptables rule at the beginning to accept, it all started working [13:14:18] so for some really weird reason the ESTABLISHED,RELATED rule was dropping packets in the middle of that connection [13:14:35] (my theory is of getting confused with tcp window scaling or something, but who knows) [13:15:26] the rebooted box with 5.10.46 seems to work fine, although there is a possibility the reboot fixed it for another reason, rather than the older kernel [13:18:00] And Murphys' law with "Mx2001 came up with no IP" [13:18:14] yeah, I read through the doc and backscroll, a kernel bug seems quite plausible, in comparison to 5.10.46 this kernel was also affect by the conntrack VRF bug which affected cloudgw (and maybe those are even linked in some way) [13:19:04] 5.10.70 was enabled on a number of additional bullseye VMs by means of the Ganeti update which required qemu restarts [13:19:42] I wanted to discuss in the IF meeting later on, but will likely roll those back to 5.10.46 out of caution as well [13:21:28] the forthcoming Bullseye 11.2 update on the 18th will ship an update to the current 5.10.x releases and in addition to the revert of the VRF bug there's also a few more conntrack related changes, none with a 100% clear match, but could also be ripple effects [13:32:05] nod [13:32:54] moritzm: btw, mutante rebooted mx2001 and the interface was renamed from ens3 to ens13?! [13:33:06] and /e/n/i wasn't adjusted of course so he had to manually fix it [13:33:19] mentioning it because it could be an artifact of the ganeti/qemu upgrade [13:33:23] and could affect reboots of other hosts [13:33:33] PCI ID changed? [13:33:51] yeah there is a task somewhere [13:34:23] yeah, I'll investigate this further, we did see this in the past, but could also be linked to the update [13:34:27] https://phabricator.wikimedia.org/T272555 [13:34:31] was the older task [13:35:27] there's a handful of unused-in-codfw VMs which I'll reboot to narrow this down [13:40:14] if we get hosts with renamed interfaces we should re-run the puppetdb import script in Netbox to match reality. Moritz, if that happen after the reboots let me know and I can automate those runs [13:48:09] ack, thanks. first need to unterstand this a little further, but will reach out as needed :-) [14:32:50] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > @BTullis thanks! Real-time, would be a nice plus, but a hard requirement (unlike netflow). Did you mean _not_ a hard require... [15:40:10] 10SRE-tools, 10homer, 10netbox, 10netops, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) 05In progress→03Stalled Waiting for Capirca upstream to merge PRs. [15:52:12] 10netbox, 10Infrastructure-Foundations: Import row information into Netbox for Ganeti instances - https://phabricator.wikimedia.org/T262446 (10joanna_borun) 05Open→03In progress [15:52:17] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10joanna_borun) [15:55:32] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) > Did you mean _not_ a hard requirement? Yep, my bad :) [16:59:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ayounsi) 05Open→03Resolved Alright, closing this for now then :) [17:33:30] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) [17:36:57] 10SRE-tools, 10netbox, 10Infrastructure-Foundations: Netbox support for svc allocation - https://phabricator.wikimedia.org/T263429 (10Volans) 05In progress→03Open [17:37:08] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) 05In progress→03Resolved a:03Dzahn This became incident T297127 for which we will shortly release a public incident report (as part of the incident ticket, but will... [17:37:11] 10netbox, 10Infrastructure-Foundations: Add git-local-changes check for netbox-extras - https://phabricator.wikimedia.org/T250288 (10Volans) 05In progress→03Open [17:37:52] 10netbox, 10Infrastructure-Foundations, 10IPv6, 10User-jbond: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173 (10Volans) 05In progress→03Open [17:38:06] 10netbox, 10Infrastructure-Foundations: Netbox CSV dumps can't be compared - https://phabricator.wikimedia.org/T262671 (10Volans) 05In progress→03Open [17:38:40] 10netbox, 10Infrastructure-Foundations: Netbox missing hourly dumps - https://phabricator.wikimedia.org/T262674 (10Volans) 05In progress→03Open [17:41:12] 10SRE-tools, 10Infrastructure-Foundations: Manage DHCP of Ganeti VMs from Netbox - https://phabricator.wikimedia.org/T297133 (10Volans) [17:41:25] 10SRE-tools, 10Infrastructure-Foundations: Manage DHCP of Ganeti VMs from Netbox - https://phabricator.wikimedia.org/T297133 (10Volans) p:05Triage→03Medium [17:43:32] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Manage DHCP from Netbox - https://phabricator.wikimedia.org/T269855 (10Volans) 05In progress→03Resolved The automation of the DHCP for physical hosts has been completed and there are no more MAC addresses hardcoded in Puppet for those hosts. See... [17:46:55] 10Mail, 10Infrastructure-Foundations, 10SRE: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10herron) [18:21:52] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10Legoktm) [20:30:45] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) [20:31:25] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) [20:35:09] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) [20:36:32] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) deep link to existing Icinga check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mx2001&service=exi... [21:10:00] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) [21:19:05] 10Mail, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10herron) In addition to the overall queue totals `exiqsumm` provides a breakdown by destination domain. It would... [21:19:40] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) T275867 may be of interest here as well [21:22:39] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) We'll also want to think about the failure modes for this alert specifically, e.g. if mail is significantly impacted how w... [21:24:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron) [21:24:11] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) [21:24:17] 10Mail, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10herron) [21:27:55] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Dzahn) [22:13:02] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) @bcampbell This actually turned out to be a firewall dropping packets due to a kernel bug. I shared a doc with you if you are curious. [22:15:11] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Legoktm) My understanding per T225623#5253119 is that `@ticket.wikim... [22:30:44] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10bcampbell) @Dzahn Thanks for sharing the doc, that's helpful. Are there any outstanding emails left in the queue? [22:41:22] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) @bcampbell No more mails in the queue and exim is stil disabled on the server that was affected. mail is currently handled by the other server. [22:42:42] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10bcampbell) @Dzahn Got it, thank you for clarifying. [23:49:24] 10Mail, 10Infrastructure-Foundations, 10SRE: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-12-03_mx