[03:14:18] (SystemdUnitFailed) firing: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:18] (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) Thanks for the report. This is because we advertise our "customer" prefixes from all our POPs to improve the use... [07:25:30] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) Great timing, a new DB server was running at 10M (see T334446). This however wasn't seen in `ethtool` but only in... [08:22:51] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:51] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:51] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:18] (SystemdUnitFailed) resolved: (6) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:55] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10Volans) >>! In T333007#8774632, @ayounsi wrote: > This however wasn't seen in `ethtool` but only in logs and on the switch... [10:35:57] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new... [11:59:49] 10Puppet, 10Infrastructure-Foundations: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) p:05Triage→03Medium [12:16:28] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:19:26] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:23:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:31:26] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:31:36] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [12:34:51] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:35:44] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) >>! In T333377#8775126, @Marostegui wrote: > @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather... [12:38:30] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) Thank you, nothing changes from our DB side! [12:39:16] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:40:35] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10SLyngshede-WMF) [12:41:13] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10SLyngshede-WMF) idm servers have the module installed, but not enabled. [12:50:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [12:52:47] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) >>! In T334577#8775710, @SLyngshede-WMF wrote: > idm servers have the module installed, but not enabled. the apache2 package installs the file so t... [13:04:26] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) [13:08:51] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) So it turns out none of our Apache installs which had it running actually needs it; these 11 cases must all have been caused by random d... [13:17:37] 10netops, 10Infrastructure-Foundations: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) p:05Triage→03Low [13:26:03] 10netops, 10Infrastructure-Foundations: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) [13:31:38] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) the logic we use in puppet is mostly the same as [[ https://phabricator.wikimedia.org/P46511 | this script ]] which would be a good template to use for a cookbook [13:39:40] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10jbond) awesome thanks @MoritzMuehlenhoff [14:30:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) FWIW I've submitted a new patchset with a different format for defining the routes in YAML (at Arzhel's suggestion). ` static... [14:52:00] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) p:05Medium→03High [14:52:17] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [15:17:55] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) >>! In T333007#8774632, @ayounsi wrote: > However I guess it's also possible that the opposite happens: the issue... [15:30:45] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) @jbond suggested this may be better handled by exposing the data to prometheus, and using alertmanager to check.... [16:18:58] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10cmooney) I'd consider client auth a "stretch goal" for now, nice to have but not sure we want to have all that extra complexity. In terms of an intermediate CA just for network... [16:24:20] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [18:09:54] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10jbond) > suggested this may be better handled by exposing the data to prometheus, and using alertmanager to check. Copied c... [18:23:27] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) >I would worry about how we deal with the security / key management aspects of it. Just to expand on this a bit the reason why there may be a need for an additional inte... [20:50:40] 10Mail, 10Infrastructure-Foundations, 10fundraising-tech-ops: Investigate in-house DMARC analysis tool options - https://phabricator.wikimedia.org/T317443 (10Jgreen) 05Open→03Declined Closing this task because it is no longer needed/relevant, since SRE is looking at a solution in T330944. [20:50:46] 10Mail, 10fundraising-tech-ops: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen)