[09:45:55] * jbond rebooting sretest1002 [10:33:50] 10Packaging, 10Infrastructure-Foundations, 10User-Kormat: generate-debdeploy-spec breaks when trying to use the transition feature - https://phabricator.wikimedia.org/T260680 (10jbond) [11:21:26] 10Packaging, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10jbond) [11:44:14] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) [11:44:22] 10netops, 10Infrastructure-Foundations, 10SRE: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) 05Open→03Resolved Merged patch based on option 5, but using hostname rather than any other var to determine device class.... [13:02:40] 10Packaging, 10Infrastructure-Foundations, 10Platform Engineering (Icebox): New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10jbond) [13:43:42] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:34] 10Mail, 10Infrastructure-Foundations: Look into behaviour of /etc/exim4/update-exim4.conf.conf related to updates - https://phabricator.wikimedia.org/T154665 (10jbond) > Additionally you might want to set dc_eximconfig_configtype=none We already add this so can we just close this task? [13:47:16] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10jbond) [14:03:42] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:08] jbond: +1 on https://phabricator.wikimedia.org/T183210#8873095 [14:30:51] Just read that and I'm inclined to agree too [14:32:50] cheers [14:43:42] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:15] 10Packaging, 10Infrastructure-Foundations: Remove old builds on package builder - https://phabricator.wikimedia.org/T237713 (10jbond) [14:45:33] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 (10jbond) [15:48:50] 10SRE-tools, 10Infrastructure-Foundations, 10Datacenter-Switchover: Hide `systemctl is-enabled` output in switchover cookbooks - https://phabricator.wikimedia.org/T285520 (10jbond) [15:49:16] 10SRE-tools, 10Infrastructure-Foundations, 10Datacenter-Switchover: --live-test mode of switchdc cookbook should auto downtime "High average GET latency" alerts - https://phabricator.wikimedia.org/T285521 (10jbond) [16:28:09] 10Packaging, 10Infrastructure-Foundations, 10Platform Engineering (Icebox): New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10Eevans) >>! In T178839#8873137, @jbond wrote: >>>! In T178839#3835604, @Eevans wrote: >>>>! In T178839#3811294, @Eevans wrote: >>> [ ... ] > looking at the task hi... [16:35:23] 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: debmonitor: Traceback in the apt hook when purging a package in rc state - https://phabricator.wikimedia.org/T273269 (10jbond) [16:38:09] 10Packaging, 10Infrastructure-Foundations, 10SRE, 10User-MoritzMuehlenhoff: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951 (10jbond) [16:41:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations: add contract end dates to the ops maint & contract gcal - https://phabricator.wikimedia.org/T84585 (10jbond) [18:43:42] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [19:10:10] 10netops, 10Infrastructure-Foundations, 10SRE: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Open→03Resolved This is now modeled in Netbox in the 'upstream_speed' field of the z-end of a circuit termination. The one service we have where it... [19:12:51] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) Completed today 1 E1 lvs1018 lsw1-e1-eqiad xe-0/0/47 ssw1-e1-eqiad xe-0/0/33 [19:19:26] 10netops, 10Infrastructure-Foundations, 10SRE: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Resolved→03Open [19:19:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [19:47:18] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) We had a chat about this. The first iteration will be a manual cookbook that takes a host as parameter. The cookbook will connect to the device and see if there is alre... [20:03:42] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:42] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) Thanks @Jclark-ctr I think we're good to do the other two lvs moves whenever you are ready. Please ping me on irc and we can arran... [20:32:44] volans: https://blog.pypi.org/posts/2023-05-23-removing-pgp/ [20:36:07] 10SRE-tools, 10netops, 10Infrastructure-Foundations: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) p:05Triage→03High [20:37:24] XioNoX: doh, I knew that day will come, btw any idea for ^^^ ( T337345 ) [20:37:24] T337345: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 [20:39:05] looking [20:45:08] could be related to the work from https://phabricator.wikimedia.org/T320508 but no obvious issue so far [20:45:51] right that could be an option [20:46:16] do you know if any dhcp-related operation from the mgmt network succeeded after changing that? [20:46:32] no idea [20:46:46] I rolled back the only difference and traffic didn't show up [20:47:05] on install1004? [20:47:20] change on mr1, traffic on install [20:47:25] k [20:47:54] seems unrelated to my changes to the dhcp server I guess [20:48:08] as there is no inbound [20:48:18] I'm wondering if there is something funky where the core router try to intercept the dhcp packet from mr1 [20:52:11] I'm tcpdumping on all install [20:52:19] the only one getting anything is drmrs's one [20:52:28] different setup [20:52:43] do we have any unsetup host there? [20:53:00] where? [20:53:04] drmrs [20:53:24] I can see DHCP traffic exiting mr1-eqiad `10.65.0.1.bootps > 208.80.154.74.bootps: BOOTP/DHCP, Request [|bootp]` [20:54:09] but it's not making it on the install server [20:54:13] to* [20:54:14] interesting [20:54:28] but pings go through: mr1-eqiad.mgmt.eqiad.wmnet > install1004.wikimedia.org [20:54:41] mr1-eqiad# run ping 208.80.154.74 source 10.65.0.1 [20:55:38] it's unicast traffic from mr1 to install, so it shouldn't get caught in any dhcp helpers or whatnot [20:57:11] k [21:01:20] volans: is install1004 missing an iptables rule permitting dhcp traffic from 10.65.0.1? [21:02:50] I'm getting tired, maybe I'm just not seeing it [21:03:20] mmmh interesting, that rings also a bell on a thing I looked with cathal last week, I'll double check thanks [21:03:25] go offline, can wait tomorrow [21:04:03] but iirc tcpdump catch traffic before iptables [21:04:48] yeah it should see the incoming [21:05:36] `show security policies from-zone mgmt to-zone production` [21:05:38] no dhcp [21:08:08] interesting [21:09:29] yeah but no, that's no it, tried a temporary rule but no change [21:09:53] and the source IP didn't change so it would have been weird if it worked [21:10:15] I'm looking at rancid emails [21:10:19] so far no smoking gun [21:10:35] some rename from group DHCP-RELAY { to group dhcp_relay { [21:11:55] the one from [21:12:09] May 18, 2023, 6:10 PM has some related changes, not sure if might be related [21:12:15] (CEST) [21:12:23] I just hope it's not an option 82 kind of issue :) [21:12:35] they're notusing option 82 [21:12:42] I know :) [21:12:49] there is a [21:12:50] + overrides { [21:12:51] + trust-option-82; [21:12:51] + } [21:13:07] yeah it should be noop here and I removed it to test [21:13:11] no luck [21:15:32] ah found it [21:15:43] ? [21:16:12] maybe... [21:16:17] I see one occurrence of dhcp-relay [21:16:21] all the others are dhcp_relay [21:16:36] where? [21:17:03] also grepping in homer/public but no seems ok [21:17:15] dhcp-relay is the name of the 'forwarding-options' [21:17:24] and dhcp_relay is the custom name for the group [21:17:28] yeah [21:17:46] could it be that a name with the underscore doesn't work? [21:17:47] dhcp_relay [21:17:53] before was DHCP-RELAY [21:17:53] nah [21:18:57] I'll check with cat.hal tomorrow if might be https://gerrit.wikimedia.org/r/c/operations/homer/public/+/908346 [21:20:32] go to bed [21:22:14] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) With @ayounsi we've checked a bunch of things and so far we didn't find anything wrong. The traffic seems to exit from `mr1` but dosn't make it to the... [21:22:17] thanks a lot for all the help and checks at this late hour [21:22:22] I've updated the task [21:22:53] thx! I'll have another look in the morning