[02:01:21] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 12d 11h 45m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [03:48:44] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:21] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 12d 7h 45m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [07:48:44] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:16] topranks, XioNoX: wrt T355899 do you need me to have a deeper look or you're happy with the tested behaviour on netbox-next for now? [09:18:16] T355899: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899 [09:24:18] volans: I guess no harm if you were to have a quick look, perhaps you might spot something we missed [09:24:49] it’s not mega-important, I’ve been fixing manually where needed for the codfw switch moves which isn’t too hard [09:25:08] ck [09:25:10] ack [09:25:13] otherwise our hope was simply that it was some bug that would disappear when we upgrade [09:25:34] definitely looks to be some bug so probably not worth a massive effort [09:27:15] +1, depends on your other priorities :) [09:27:41] the netbox upgrade will shortly be renamed the panacea for all bugs :D [09:29:40] of course [10:01:21] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 12d 3h 45m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [11:48:44] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:40] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9576320 (10cmooney) p:05Triage→03Medium [13:48:16] volans: I hit an issue with reimaging hosts on our public vlans in codfw row a/b under the new setup [13:48:19] (above task) [13:48:55] it leads me to a question regarding how our DHCP server operates [13:49:13] it wasn't immediately clear to me looking at the config - but will the server only accept requests from subnets it is configured for? [13:49:42] i.e. is there a way a DHCP request with this cct id: [13:49:48] "lsw1-b8-codfw:ge-0/0/41.0:public1-b-codfw" [13:50:06] would get a response if the packet came from 10.192.255.15 [13:50:26] (as opposed to arriving from an IP on the public1-b-codfw subnet) [13:52:27] hey [13:52:36] it shouldn't matter from where the request comes no? [13:53:05] I did a quick test and the packet didn't get a response when it came from the loopback IP of the switch [13:53:11] (which is what the above is) [13:53:34] iptables has a rule that blocked it, but even with a temp additional rule dhcpd didn't respond [13:54:10] and it had responded to previous similar requests, the only difference being the source IP [13:54:39] anything in the logs? [13:54:41] if you don't know off-hand I can do some research and try to work what it doesn't like [13:54:46] good question :) [13:55:20] but I guess it's possible that if the subnet is different it doesn't mtch [13:57:05] yeah could be some automatic filtering [13:57:35] nothing in logs about it, which might also support that [14:01:23] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 11d 23h 45m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [14:43:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9576504 (10fnegri) a:03fnegri I have updated the patch by @dcaro (https... [14:43:48] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9576507 (10fnegri) [14:49:00] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:29] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:17] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442#9576681 (10joanna_borun) p:05Triage→03Medium [15:31:35] 10Mail, 10Infrastructure-Foundations, 10SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9576691 (10joanna_borun) p:05Triage→03Medium [15:36:21] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9576710 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [15:36:27] 10Mail, 10Infrastructure-Foundations, 10SRE: Integrations tests - https://phabricator.wikimedia.org/T358355#9576711 (10joanna_borun) p:05Triage→03Medium [15:44:22] 10netops, 10Infrastructure-Foundations, 10Traffic, 10Sustainability (Incident Followup): Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455#9576748 (10joanna_borun) 05Open→03Resolved [15:45:43] 10netops, 10Infrastructure-Foundations, 10Traffic, 10Sustainability (Incident Followup): Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455#9576750 (10CDanis) This would best be fixed by extending the haproxy bwlim work done in T317799 -- we've talked about h... [15:46:24] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343#9576754 (10jhathaway) a:03jhathaway [15:46:56] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#9576756 (10ayounsi) a:03ayounsi [15:47:21] 10Packaging, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech: Migrate WDQS to Java 11 - https://phabricator.wikimedia.org/T316103#9576758 (10joanna_borun) [15:49:15] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899#9576760 (10cmooney) p:05High→03Medium Lowering priority, we're getting by ok doing the trunk ports manually during the switch migrations. [15:58:26] 10Packaging, 10Thumbor, 10Wikimedia-SVG-rendering: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549#9576795 (10joanna_borun) [16:05:20] 10Packaging, 10Infrastructure-Foundations, 10SRE, 10serviceops: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210#9576817 (10joanna_borun) [16:05:28] 10Packaging, 10Infrastructure-Foundations, 10SRE, 10serviceops: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210#9576820 (10joanna_borun) @Reedy is it still valid? [16:42:02] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577083 (10cmooney) Digging a little deeper on this the source IP of the packets hitting the install server don't really matter, what is mo... [18:03:17] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 11d 19h 43m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [18:05:42] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577561 (10cmooney) Juniper seem to document this scenario here, and advise using the "link-selection" keyword: https://www.juniper.net/do... [18:19:28] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577613 (10cmooney) After issuing a manual release of the IP and trying again things seem to be working as expected: ` cmooney@install2004:... [18:21:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9577617 (10Volans) @fnegri Thanks a lot for resuming this and taking care... [18:23:11] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577627 (10cmooney) So I think the solution is: # Add the "link-selection" command to the config on EVPN switches to add the IRB interface... [18:51:17] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577701 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with... [19:13:29] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:23:46] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9577866 (10cmooney) p:05Low→03Medium Actually a different need to upgrade has now become clear, relating to the issue detailed in T358488 The solution to that requ... [19:27:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577879 (10cmooney) >>! In T358488#9577627, @cmooney wrote: > # Add the "link-selection" command to the config on EVP... [19:30:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003... [19:45:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest... [20:19:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003... [22:03:17] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 11d 15h 43m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [23:13:29] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed