[00:45:29] 10Mail, 06Infrastructure-Foundations: postfix: set smtpd_forbid_bare_newline to normalize - https://phabricator.wikimedia.org/T370011 (10jhathaway) 03NEW [00:45:58] 10Mail, 06Infrastructure-Foundations: postfix: set smtpd_forbid_bare_newline to normalize - https://phabricator.wikimedia.org/T370011#9979974 (10jhathaway) p:05Triage→03Low [00:46:44] 10Mail, 06Infrastructure-Foundations: Add additional RSA DKIM keys with 2048 bit sizes - https://phabricator.wikimedia.org/T365389#9979975 (10jhathaway) p:05Medium→03Low [00:47:13] 10Mail, 06Infrastructure-Foundations: Investigate options for outbound email redundancy for mediawiki on kubernetes - https://phabricator.wikimedia.org/T370006#9979976 (10jhathaway) p:05Triage→03High [00:47:26] 10Mail, 06Infrastructure-Foundations: rspamd: use central redis - https://phabricator.wikimedia.org/T370007#9979977 (10jhathaway) p:05Triage→03Low [00:47:40] 10Mail, 06Infrastructure-Foundations: Add postfix grok filters - https://phabricator.wikimedia.org/T370008#9979978 (10jhathaway) p:05Triage→03Low [00:54:18] 10Mail, 06Infrastructure-Foundations: postfix: add multi instance support - https://phabricator.wikimedia.org/T370012 (10jhathaway) 03NEW [00:54:27] 10Mail, 06Infrastructure-Foundations: postfix: add multi instance support - https://phabricator.wikimedia.org/T370012#9979989 (10jhathaway) p:05Triage→03Low [07:06:32] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm [07:16:18] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980175 (10SLyngshede-WMF) a:03SLyngshede-WMF [07:36:59] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980227 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm completed: - idp1004 (**PASS*... [08:00:14] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm [08:04:56] hello folks! [08:05:30] I am checking T355750 (since it completely bugs me that it doens't work) [08:05:30] T355750: CFSSL gencert "remote error: tls: certificate require" - https://phabricator.wikimedia.org/T355750 [08:06:20] from the cookook, IIUC, the cert that should be renewed for lsw1-f8-eqiad would be for port 8080, is it right? Because I don't see any current valid cert on that port (via openssl s_client) [08:09:41] even with Python I get "ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)", that looks like the wrong port [08:09:57] I am trying to follow the code to understand what it does basically [08:33:30] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9980308 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm completed: - idp2004 (**PASS*... [08:37:49] For the error it would seem more like a wrong "type" of connection, TLS 1.0 vs. 1.2 or something [08:41:30] elukey: I can probably help :) [08:41:35] o/ [08:43:11] hello netops, as oncaller I'd like to know what's the status of cr3-ulsfo and how it affects oncallers :) [08:43:24] volans: what's up with it? [08:44:06] 10Mail, 06Infrastructure-Foundations, 07Epic: Email improvements round two (FY 2024/25) - https://phabricator.wikimedia.org/T370005#9980344 (10Aklapper) [08:45:02] volans: from what I can see it's doing well. It lost one of it's power feed over the weekend but that was brief [08:45:45] ah just one feed? from the backlog in _security sounded more serious [08:49:28] volans: ah seems like there was also a brief alert for high CPU usage, but seems like it all went back to normal on its own [08:50:47] Can someone that knows about verp_config merge or be there to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053650 please? [08:50:53] elukey: so lsw1-f8-eqiad is the device I installed OSS Sonic on, so it has the default self-signed/bogus certificate [08:51:31] elukey: so the check is failing as expected, and should triger a certificate remplacement [08:52:43] elukey: but to test the cookbook, you can use lsw1-e8-eqiad, which is Dell Sonic and I wrote the cookbook for that version of Sonic (not 100% sure how it behaves with the OSS version) [08:53:40] claime: not sure there is any expert on that TZ :( [08:54:44] XioNoX: not a problem, I'll catch jessie at shift change then [09:11:44] XioNoX: ack thanks, but I am also interested to know what happens with lsw1-f8, is it ok if I run the cookbook for it as well? [09:11:59] or even lsw1-d3-codfw [09:12:55] elukey: yep, but for f8 it might just not do the right thing, but it's a test device so no problem [09:44:56] ok after some tests it seems to me that httpd on pki1001 is the cause of the remote error, and we don't see in httpd's logs since mod_ssl is configured with "warn" [10:00:48] filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054289 to get more info [10:01:44] elukey: +1 [10:08:24] it might be a little verbose, I'll keep the log size monitored (it should rotate eetc..) [10:08:27] anyway [10:08:30] [Mon Jul 15 10:07:38.418738 2024] [ssl:info] [pid 633442:tid 633475] SSL Library Error: error:1417C0C7:SSL routines:tls_process_client_certificate:peer did not return a certificate -- No CAs known to server for verification? [10:08:40] this is cfssl on cumin1002 -> pki1001 [10:09:29] surprising that this error only show at the "info" level [10:10:17] yeah I agree, but I think there is a distinction between client errors vs httpd/ssl-module errors [10:10:25] so a client failing is considered "info" [10:10:29] I think at least [10:10:40] ahh ok [11:02:03] 10CFSSL-PKI, 06Infrastructure-Foundations, 13Patch-For-Review: CFSSL gencert "remote error: tls: certificate require" - https://phabricator.wikimedia.org/T355750#9980838 (10elukey) After some digging, it seems to me that the issue is httpd on pki1001: it rejects the client authentication from cumin1002. I ad... [11:29:18] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:36] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:29] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:33:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:44:18] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:18] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:36] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:18] FIRING: [32x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:49] FIRING: [3x] PuppetConstantChange: Puppet performing a change on every puppet run on netbox1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:54:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:11] that's just the downtime on the WIP Netbox host that expired [12:55:49] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:57:52] 10netbox, 06Infrastructure-Foundations: Netbox : sync src/ submodule - https://phabricator.wikimedia.org/T369690#9981120 (10ayounsi) Thanks, I'd like to avoid having to use an extra repository if possible, to reduce the complexity as much as possible. > make the src/ directory a regular directory and not a gi... [12:59:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:36] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:04] 10netops, 06Infrastructure-Foundations: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048 (10ssingh) 03NEW [13:36:05] 10netops, 06Infrastructure-Foundations: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981303 (10ssingh) p:05Triage→03Medium [13:36:14] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981304 (10ssingh) [13:39:27] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981315 (10ssingh) [13:49:18] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:57] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9981347 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cc358df6-b5c1-490c-aad1-6454f09f0fc8) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their s... [13:52:08] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9981348 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60bc5c40-0301-4c29-907d-b4e0eb5e3cb3) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their s... [13:57:44] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981361 (10ayounsi) If I was paranoid, I'd say it's possibly a bug being exploited that can cause a DDoS and we should prioritize T364092. We have a couple runbooks that could fit the sit... [13:58:36] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo flapping on July 14 - https://phabricator.wikimedia.org/T370048#9981370 (10ssingh) >>! In T370048#9981361, @ayounsi wrote: > If I was paranoid, I'd say it's possibly a bug being exploited that can cause a DDoS and we should prioritize T364092. > > We... [14:10:54] 10netbox, 06Infrastructure-Foundations: Netbox : sync src/ submodule - https://phabricator.wikimedia.org/T369690#9981394 (10Volans) >>! In T369690#9981120, @ayounsi wrote: > Thanks, I'd like to avoid having to use an extra repository if possible, to reduce the complexity as much as possible. > >> make the src... [14:30:18] 10Mail, 06Infrastructure-Foundations, 07Epic: Email improvements round two (FY 2024/25) - https://phabricator.wikimedia.org/T370005#9981470 (10jhathaway) p:05Triage→03Medium [14:33:38] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9981488 (10ayounsi) p:05Triage→03Low [14:34:10] 10netbox, 06Infrastructure-Foundations: Netbox : sync src/ submodule - https://phabricator.wikimedia.org/T369690#9981490 (10ayounsi) p:05Triage→03Medium a:03ayounsi [15:31:29] FIRING: SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:32:20] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9981911 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=24d499e4-d334-4d4e-8fcd-fc9f2feed844) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their s... [15:42:42] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 (10ssingh) 03NEW [18:05:55] 10Packaging, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9982908 (10herron)