[00:08:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181008 [00:08:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181008 (owner: 10TrainBranchBot) [00:09:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.782 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:09:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.977 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:19:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:19:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:21] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 2.827 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:21] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.986 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:59] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181008 (owner: 10TrainBranchBot) [00:32:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:32:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:35:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.485 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:35:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.573 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:29] !log reprepro -C component/envoy-future include bullseye-wikimedia envoyproxy_1.26.8-1_amd64.changes # T402584 🤦 [00:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:34] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [00:47:38] (03PS1) 10RLazarus: envoy-future: Update to 1.26.8 for real [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181009 (https://phabricator.wikimedia.org/T402584) [00:48:47] (03CR) 10Scott French: [C:03+1] envoy-future: Update to 1.26.8 for real [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181009 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [00:49:55] (03CR) 10RLazarus: [V:03+2] "Verified locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181009 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [00:50:26] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to 1.26.8 for real [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181009 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [00:53:42] 10ops-codfw, 06SRE, 06DC-Ops: Add scs-e3-codfw to monitoring - https://phabricator.wikimedia.org/T401310#11109330 (10Papaul) 05Open→03Resolved I add the rancid user to scs-e3-codfw. Thanks [02:19:33] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:31] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:37:31] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:42:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:42:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.211 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:48:31] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:49:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.760 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:52:17] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11109389 (10RLazarus) For posterity -- I fatfingered the `reprepro include` the first time and included the _source.changes without the _amd64.changes, so for a couple hours we had... [02:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:00:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:00:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:04:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:04:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.168 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:18:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11109395 (10phaultfinder) [03:21:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:23:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11109399 (10phaultfinder) [03:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:44:10] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [04:14:53] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:37:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:37:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:41:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:47:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.697 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:47:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.796 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:53:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:53:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.843 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.959 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:38] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 8.920 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:23:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 8.871 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:23:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.882 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:35:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:40:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:57:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250822T0600) [06:02:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [06:09:03] (03CR) 10Ayounsi: [V:03+2 C:03+2] "Real password added according to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180856/comments/6d1cac5c_3c56b286" [labs/private] - 10https://gerrit.wikimedia.org/r/1180855 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [06:10:03] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [06:15:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:15:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:17:29] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.952 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:17:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.959 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:18:15] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [06:18:38] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:18:52] 06SRE, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live - https://phabricator.wikimedia.org/T402612 (10Joe) 03NEW [06:19:34] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:38] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:42:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) (owner: 10Dzahn) [06:53:32] (03PS1) 10Muehlenhoff: Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/1181029 (https://phabricator.wikimedia.org/T396487) [06:55:59] (03PS3) 10Ayounsi: Homer: add password to config file [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) [06:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:59:18] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [07:00:03] (03CR) 10Ayounsi: [C:03+1] Add new Nokia switches to ibgp pod e/f in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1180958 (https://phabricator.wikimedia.org/T402590) (owner: 10Cathal Mooney) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250822T0700) [07:02:01] (03CR) 10Ayounsi: [C:03+2] Homer: add password to config file [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [07:02:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T399249)', diff saved to https://phabricator.wikimedia.org/P81685 and previous config saved to /var/cache/conftool/dbconfig/20250822-070257-fceratto.json [07:03:02] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:05:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:37] (03CR) 10Ayounsi: [C:03+1] Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/1181029 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:06:04] (03CR) 10Muehlenhoff: [C:03+2] Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/1181029 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:06:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:29] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#11109532 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Trixie had the initial stable release on Aug 9 and the installer and base system works fin... [07:13:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [07:13:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T401906)', diff saved to https://phabricator.wikimedia.org/P81686 and previous config saved to /var/cache/conftool/dbconfig/20250822-071356-fceratto.json [07:14:01] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [07:18:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P81687 and previous config saved to /var/cache/conftool/dbconfig/20250822-071804-fceratto.json [07:23:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11109582 (10phaultfinder) [07:24:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install1005.wikimedia.org [07:24:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:28:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11109615 (10phaultfinder) [07:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:29:49] jmm@cumin2002 makevm (PID 1008855) is awaiting input [07:30:43] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [07:31:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install1005.wikimedia.org - jmm@cumin2002" [07:33:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P81688 and previous config saved to /var/cache/conftool/dbconfig/20250822-073312-fceratto.json [07:34:53] jmm@cumin2002 makevm (PID 1008855) is awaiting input [07:35:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install1005.wikimedia.org - jmm@cumin2002" [07:35:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:35:19] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install1005.wikimedia.org on all recursors [07:35:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install1005.wikimedia.org on all recursors [07:35:42] (03CR) 10Ayounsi: [C:03+1] Add new IBGP cluster in eqiad with pod for row C/D Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1180953 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [07:35:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install1005.wikimedia.org - jmm@cumin2002" [07:36:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install1005.wikimedia.org - jmm@cumin2002" [07:36:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install1005.wikimedia.org with OS bookworm [07:36:30] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11109664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install1005.wikimedia.org with OS bookworm [07:37:23] (03PS1) 10Vgutierrez: benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) [07:37:49] (03CR) 10CI reject: [V:04-1] benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [07:39:34] (03PS2) 10Vgutierrez: benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) [07:40:33] (03PS2) 10DCausse: cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) [07:40:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [07:40:46] (03CR) 10DCausse: cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [07:40:51] (03CR) 10DCausse: cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [07:48:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T399249)', diff saved to https://phabricator.wikimedia.org/P81689 and previous config saved to /var/cache/conftool/dbconfig/20250822-074819-fceratto.json [07:48:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:48:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [07:48:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T399249)', diff saved to https://phabricator.wikimedia.org/P81690 and previous config saved to /var/cache/conftool/dbconfig/20250822-074842-fceratto.json [07:53:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f9c0e76f1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [07:53:31] dia.org/wiki/Search%23Administration [07:53:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f132208a1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [07:53:41] dia.org/wiki/Search%23Administration [07:56:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:56:09] (03CR) 10Vgutierrez: [C:03+2] admin: Add dimakoushha to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1180851 (https://phabricator.wikimedia.org/T402384) (owner: 10Vgutierrez) [07:57:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11109760 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez change has been merged, please allow 30 minutes to let puppet apply the changes on the r... [07:58:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install1005.wikimedia.org with reason: host reimage [07:58:38] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:08] (03CR) 10Vgutierrez: [C:03+2] admin: add user aude to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) (owner: 10Dzahn) [07:59:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [07:59:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2229.codfw.wmnet with reason: Maintenance [08:00:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Superset / LDAP access for aude - https://phabricator.wikimedia.org/T402022#11109770 (10Vgutierrez) 05In progress→03Resolved a:03Dzahn looks good, I took care of merging it, thanks for the patch @Dzahn [08:01:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:02:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install1005.wikimedia.org with reason: host reimage [08:03:38] fceratto@cumin1002 sanitize-wiki (PID 2800516) is awaiting input [08:09:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:09:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T401906)', diff saved to https://phabricator.wikimedia.org/P81691 and previous config saved to /var/cache/conftool/dbconfig/20250822-080922-fceratto.json [08:09:27] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [08:09:34] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:38] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:42] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2102 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1499, active_shards: 4315, relocating_shards: 0, initializing_shards: 11, unassigned_shards: 10, delayed_unassigned_shards: 0, number_of_pending_ [08:13:42] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 7136, active_shards_percent_as_number: 99.51568265682657 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:16:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:16:30] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2113 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1506, active_shards: 4324, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of_pendin [08:16:30] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.72324723247232 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:16:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install1005.wikimedia.org with OS bookworm [08:16:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install1005.wikimedia.org [08:16:56] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11109802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install1005.wikimedia.org with OS bookworm completed: - install1005 (**PASS**) - Removed fro... [08:18:38] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install2005.wikimedia.org [08:22:56] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:26:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:28:43] jmm@cumin2002 makevm (PID 1039902) is awaiting input [08:30:04] (03CR) 10Fabfur: [C:03+1] benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [08:35:43] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install2005.wikimedia.org - jmm@cumin2002" [08:36:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install2005.wikimedia.org - jmm@cumin2002" [08:36:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:36:57] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install2005.wikimedia.org on all recursors [08:37:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install2005.wikimedia.org on all recursors [08:37:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install2005.wikimedia.org - jmm@cumin2002" [08:37:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install2005.wikimedia.org - jmm@cumin2002" [08:38:38] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2079.codfw.wmnet with OS bullseye [08:39:04] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11109836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2079.codfw.wmnet with OS bullseye [08:39:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2079 [08:39:34] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:57] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [08:40:38] jmm@cumin2002 makevm (PID 1039902) is awaiting input [08:45:53] mvernon@cumin2002 reimage (PID 1048854) is awaiting input [08:45:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install2005.wikimedia.org with OS bookworm [08:46:06] (03PS3) 10Vgutierrez: benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) [08:46:07] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11109839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install2005.wikimedia.org with OS bookworm [08:46:33] (03CR) 10CI reject: [V:04-1] benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [08:47:24] (03PS4) 10Vgutierrez: benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) [08:50:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [08:50:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2207.codfw.wmnet with reason: Maintenance [08:51:59] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2079 - mvernon@cumin2002" [08:54:14] (03PS5) 10Vgutierrez: benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) [08:55:04] mvernon@cumin2002 reimage (PID 1048854) is awaiting input [08:56:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [08:56:47] (03CR) 10Fabfur: [C:03+1] benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [08:57:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:57:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:58:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2079 - mvernon@cumin2002" [08:58:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:06] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2079.codfw.wmnet 244.48.192.10.in-addr.arpa 4.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:58:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2079.codfw.wmnet 244.48.192.10.in-addr.arpa 4.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:58:10] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2079 [08:58:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:58:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:00:46] (03CR) 10Ayounsi: [C:03+1] Nokia: Add examples for Nokia password hashes commonly used (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:01:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2079 [09:01:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2079 [09:03:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install2005.wikimedia.org with reason: host reimage [09:03:03] (03PS1) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:05:08] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:07:43] (03PS2) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:08:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install2005.wikimedia.org with reason: host reimage [09:09:09] (03CR) 10Muehlenhoff: P:puppetserver::volatile generate datacenter database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:09:48] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:10:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:10:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:11:51] (03PS3) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:13:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.428 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:13:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 2.493 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:13:57] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:17:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [09:19:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2079.codfw.wmnet with reason: host reimage [09:19:37] (03CR) 10Ayounsi: [C:03+1] Nokia: Add initial Python files for nokia switch system config (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:20:31] (03PS4) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:23:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [09:23:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2079.codfw.wmnet with reason: host reimage [09:24:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install2005.wikimedia.org with OS bookworm [09:24:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install2005.wikimedia.org [09:24:20] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11109887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install2005.wikimedia.org with OS bookworm completed: - install2005 (**PASS**) - Removed fro... [09:24:38] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6699/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:25:40] (03CR) 10Cathal Mooney: "Thanks for the review! Replies in-line." [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:27:31] (03CR) 10Brouberol: [C:03+1] "Not strictly necessary as these are test files, but at least the hostname won't show up on codesearch anymore." [alerts] - 10https://gerrit.wikimedia.org/r/1180526 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [09:28:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4003.wikimedia.org [09:28:21] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:28:48] (03CR) 10Brouberol: Remove mention of an-druid100[1-2] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [09:32:30] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4003.wikimedia.org - jmm@cumin2002" [09:32:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4003.wikimedia.org - jmm@cumin2002" [09:32:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:32:37] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4003.wikimedia.org on all recursors [09:32:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install4003.wikimedia.org on all recursors [09:33:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install4003.wikimedia.org - jmm@cumin2002" [09:33:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install4003.wikimedia.org - jmm@cumin2002" [09:33:52] (03PS5) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:33:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install4003.wikimedia.org with OS bookworm [09:34:11] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11109929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install4003.wikimedia.org with OS bookworm [09:34:23] (03CR) 10Ayounsi: [C:03+1] Nokia: Add initial Python files for nokia switch system config (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:37:15] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6700/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:41:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2079.codfw.wmnet with OS bullseye [09:41:17] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11109964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2079.codfw.wmnet with OS bullseye compl... [09:44:25] (03PS1) 10Muehlenhoff: Apply installserver role on install4003 [puppet] - 10https://gerrit.wikimedia.org/r/1181094 (https://phabricator.wikimedia.org/T396487) [09:44:27] (03PS1) 10Muehlenhoff: installserver: Failover DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1181095 (https://phabricator.wikimedia.org/T396487) [09:44:42] (03PS1) 10Muehlenhoff: Failover webproxy in ulsfo to new node [dns] - 10https://gerrit.wikimedia.org/r/1181096 (https://phabricator.wikimedia.org/T396487) [09:51:24] (03CR) 10Ayounsi: [C:03+1] Nokia: module for interface configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install4003.wikimedia.org with reason: host reimage [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install4003.wikimedia.org with reason: host reimage [09:58:38] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:25] (03CR) 10Cathal Mooney: Nokia: module for interface configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:03:26] (03PS3) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029 [10:07:46] (03PS4) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029 [10:08:20] (03Abandoned) 10Muehlenhoff: Move cloudcephosd2001-dev to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1017248 (https://phabricator.wikimedia.org/T361913) (owner: 10Muehlenhoff) [10:09:07] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11110057 (10MatthewVernon) [10:10:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11110064 (10MatthewVernon) [10:10:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11110070 (10MatthewVernon) @Jhancock.wm ms-be2081 and ms-be2082 are now ready for you to do the controller swap, please :) [10:10:46] (03CR) 10Vgutierrez: [C:03+2] benthos: Use kafka_offset to sample webrequest_live messages [puppet] - 10https://gerrit.wikimedia.org/r/1181033 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [10:13:05] (03CR) 10Ayounsi: "(not a full review yet)" [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [10:13:30] !log switch webrequest_sampled sampling from sequence number to kafka offset - T401383 [10:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:34] T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383 [10:14:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install4003.wikimedia.org with OS bookworm [10:14:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install4003.wikimedia.org [10:14:18] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11110089 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install4003.wikimedia.org with OS bookworm completed: - install4003 (**P... [10:21:37] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:21:37] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.480 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:33] (03PS1) 10Máté Szabó: hcaptcha: Fix observed issues in proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181099 [10:24:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5003.wikimedia.org [10:24:39] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:28:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5003.wikimedia.org - jmm@cumin2002" [10:31:14] jmm@cumin2002 makevm (PID 1106238) is awaiting input [10:31:37] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:37] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:08] (03PS1) 10Majavah: hieradata: Add bastion-eqiad1-5/6 [puppet] - 10https://gerrit.wikimedia.org/r/1181100 (https://phabricator.wikimedia.org/T392689) [10:35:10] (03PS1) 10Vgutierrez: benthos: Verify TLS cert of kafka brokers on webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) [10:37:15] (03CR) 10CI reject: [V:04-1] benthos: Verify TLS cert of kafka brokers on webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) (owner: 10Vgutierrez) [10:37:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.933 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:37:39] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.975 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:38:21] (03CR) 10Hnowlan: [C:03+1] "lgtm, one nit" [puppet] - 10https://gerrit.wikimedia.org/r/1181099 (owner: 10Máté Szabó) [10:39:40] (03PS2) 10Vgutierrez: benthos: Verify TLS cert of kafka brokers on webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) [10:45:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5003.wikimedia.org - jmm@cumin2002" [10:45:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:45:39] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5003.wikimedia.org on all recursors [10:45:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5003.wikimedia.org on all recursors [10:46:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install5003.wikimedia.org - jmm@cumin2002" [10:46:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install5003.wikimedia.org - jmm@cumin2002" [10:48:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install5003.wikimedia.org with OS bookworm [10:49:05] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11110222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install5003.wikimedia.org with OS bookworm [10:51:49] (03CR) 10Slyngshede: [V:03+1] P:puppetserver::volatile generate datacenter database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [10:57:22] (03CR) 10AikoChou: "Done load testing on staging and the results are within the threshold" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [10:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:58:39] (03PS2) 10Kosta Harlan: hcaptcha: Fix observed issues in proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181099 (owner: 10Máté Szabó) [10:58:47] (03CR) 10Kosta Harlan: hcaptcha: Fix observed issues in proxy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181099 (owner: 10Máté Szabó) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250822T0700) [11:00:04] jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250822T1100). [11:00:30] (03PS1) 10Muehlenhoff: Blacklist orangefs [puppet] - 10https://gerrit.wikimedia.org/r/1181110 [11:03:34] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181111 (https://phabricator.wikimedia.org/T380592) [11:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:08:05] (03CR) 10Hnowlan: [C:03+2] hcaptcha: Fix observed issues in proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181099 (owner: 10Máté Szabó) [11:14:48] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for working on this one" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [11:18:54] (03Abandoned) 10Ayounsi: WIP: example config for Nokia SR-Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [11:19:41] (03Abandoned) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [11:22:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:22:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81693 and previous config saved to /var/cache/conftool/dbconfig/20250822-112259-fceratto.json [11:23:04] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:25:13] (03CR) 10Majavah: [C:03+2] hieradata: Add bastion-eqiad1-5/6 [puppet] - 10https://gerrit.wikimedia.org/r/1181100 (https://phabricator.wikimedia.org/T392689) (owner: 10Majavah) [11:28:13] fceratto@cumin1002 sanitize-wiki (PID 2800516) is awaiting input [11:28:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11110310 (10phaultfinder) [11:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:29:41] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:29:41] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:31:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:31:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:33:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11110341 (10phaultfinder) [11:35:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install5003.wikimedia.org with reason: host reimage [11:35:40] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640 (10cmooney) 03NEW p:05Triage→03Medium [11:38:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install5003.wikimedia.org with reason: host reimage [11:39:55] (03PS1) 10Brouberol: Upgrade the airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181116 [11:40:10] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [11:42:25] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640#11110357 (10cmooney) [11:42:41] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:42:41] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:34] (03CR) 10Andrew McAllister (WMDE): [C:03+1] Upgrade the airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181116 (owner: 10Brouberol) [11:44:44] (03CR) 10Brouberol: [C:03+2] Upgrade the airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181116 (owner: 10Brouberol) [11:45:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.494 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:45:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.595 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:46:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:47:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:50:50] (03PS1) 10Ayounsi: Add CI for python [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 [11:56:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2157* gradually with 4 steps - Pooling in [11:57:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install5003.wikimedia.org with OS bookworm [11:57:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install5003.wikimedia.org [11:57:44] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11110391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install5003.wikimedia.org with OS bookworm completed: - install5003 (**P... [12:00:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install6003.wikimedia.org [12:00:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:00:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11110396 (10Jclark-ctr) a:03Jclark-ctr [12:00:41] (03PS6) 10Ayounsi: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:01:15] (03PS8) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [12:01:23] (03PS2) 10Cathal Mooney: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) [12:01:53] (03CR) 10CI reject: [V:04-1] Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:02:38] (03CR) 10CI reject: [V:04-1] Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:02:45] (03CR) 10CI reject: [V:04-1] Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:03:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install6003.wikimedia.org - jmm@cumin2002" [12:04:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037029 (owner: 10Muehlenhoff) [12:04:33] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2148* gradually with 4 steps - Pooling in [12:04:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install6003.wikimedia.org - jmm@cumin2002" [12:04:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:04:35] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install6003.wikimedia.org on all recursors [12:04:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install6003.wikimedia.org on all recursors [12:04:56] (03PS3) 10Muehlenhoff: cloudcontrol/codfw1dev:: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) [12:05:01] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install6003.wikimedia.org - jmm@cumin2002" [12:05:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install6003.wikimedia.org - jmm@cumin2002" [12:05:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install6003.wikimedia.org with OS bookworm [12:05:38] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11110403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install6003.wikimedia.org with OS bookworm [12:07:56] (03Abandoned) 10Muehlenhoff: phabricator/phab_epipe.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:15:03] (03PS2) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [12:15:23] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [12:16:23] (03PS7) 10Ayounsi: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:16:23] (03PS9) 10Ayounsi: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:17:38] (03CR) 10CI reject: [V:04-1] Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:22:08] (03PS5) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029 [12:27:37] (03PS10) 10Ayounsi: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:27:37] (03PS3) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:27:49] (03PS2) 10Stevemunene: Remove mention of an-druid100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) [12:28:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install6003.wikimedia.org with reason: host reimage [12:28:54] (03CR) 10CI reject: [V:04-1] Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:29:15] (03CR) 10Stevemunene: Remove mention of an-druid100[1-2] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [12:31:31] (03CR) 10Stevemunene: [C:03+2] Replace an-druid100[1-2] [alerts] - 10https://gerrit.wikimedia.org/r/1180526 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [12:33:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037029 (owner: 10Muehlenhoff) [12:33:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install6003.wikimedia.org with reason: host reimage [12:33:46] (03Merged) 10jenkins-bot: Replace an-druid100[1-2] [alerts] - 10https://gerrit.wikimedia.org/r/1180526 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [12:34:01] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181122 [12:34:26] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181123 [12:38:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11110447 (10Jclark-ctr) @MatthewVernon I swapped the drive, but the status light was still flashing with the new drive. In iDRAC, I could see events for the driv... [12:41:58] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2157* gradually with 4 steps - Pooling in [12:44:00] (03PS1) 10Vgutierrez: varnish: Fix UnicodeDecodeError in varnishlog output parsing [puppet] - 10https://gerrit.wikimedia.org/r/1181125 (https://phabricator.wikimedia.org/T402634) [12:50:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2148* gradually with 4 steps - Pooling in [12:50:09] (03PS1) 10Brouberol: mediawiki: change the dump pods fsGroupChangePolicy to speed up volume mount [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181127 (https://phabricator.wikimedia.org/T402644) [12:50:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install6003.wikimedia.org with OS bookworm [12:50:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install6003.wikimedia.org [12:50:41] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11110476 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install6003.wikimedia.org with OS bookworm completed: - install6003 (**P... [12:51:46] (03PS6) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [12:54:01] (03CR) 10Fabfur: [C:03+1] "It's a GO for me, but I'd wait for Monday to merge it" [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) (owner: 10Vgutierrez) [12:54:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in [12:54:48] (03PS1) 10Máté Szabó: hcaptcha: Ignore cookie related headers [puppet] - 10https://gerrit.wikimedia.org/r/1181128 [12:56:12] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [12:56:12] !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcephosd1045 [12:56:31] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [12:56:31] !log vriley@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcephosd1045 [12:59:44] (03CR) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [13:00:12] (03CR) 10Cathal Mooney: Nokia: module for network-instance configuration (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [13:01:03] (03PS2) 10Máté Szabó: hcaptcha: Ignore cookie related headers [puppet] - 10https://gerrit.wikimedia.org/r/1181128 [13:04:32] (03PS1) 10Kosta Harlan: hcaptcha: Delay challenge execution until submit [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181130 (https://phabricator.wikimedia.org/T402641) [13:04:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181130 (https://phabricator.wikimedia.org/T402641) (owner: 10Kosta Harlan) [13:05:50] (03PS1) 10Vgutierrez: haproxy: Provide basic X-Analytics data for blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181131 [13:08:40] PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table azwikimedia.loginnotify_seen_net doesnt exist https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:09:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11110511 (10Jclark-ctr) All Devices added to Netbox [13:10:06] (03PS1) 10Cathal Mooney: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 [13:12:06] (03PS1) 10Vgutierrez: haproxy: Stop sending X-Analytics-TLS to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1181133 [13:12:06] (03PS1) 10Vgutierrez: varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 [13:12:56] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:15:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:52] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [13:16:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045 [13:16:38] PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:16:40] PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:16:40] PROBLEM - MariaDB Replica Lag: s5 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:00] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 657.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:02] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:17:32] (03CR) 10Cathal Mooney: [C:03+2] Add new Nokia switches to ibgp pod e/f in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1180958 (https://phabricator.wikimedia.org/T402590) (owner: 10Cathal Mooney) [13:18:06] (03Merged) 10jenkins-bot: Add new Nokia switches to ibgp pod e/f in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1180958 (https://phabricator.wikimedia.org/T402590) (owner: 10Cathal Mooney) [13:18:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11110574 (10MatthewVernon) 05Open→03Resolved @Jclark-ctr thanks! The replacement disk is now in operation. [13:23:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:10] (03CR) 10Cathal Mooney: [C:03+2] Add new IBGP cluster in eqiad with pod for row C/D Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1180953 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [13:24:44] (03Merged) 10jenkins-bot: Add new IBGP cluster in eqiad with pod for row C/D Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1180953 (https://phabricator.wikimedia.org/T402588) (owner: 10Cathal Mooney) [13:26:10] vriley@cumin1003 provision (PID 3204341) is awaiting input [13:27:00] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [13:31:08] vriley@cumin1003 provision (PID 3204341) is awaiting input [13:31:10] (03PS2) 10Cathal Mooney: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 [13:32:20] (03CR) 10CI reject: [V:04-1] Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [13:37:15] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11110626 (10ayounsi) a:03RobH FYI, ps2 has been alerting as well: Assigning to Rob as it's a remote site, and CCing Papaul, feel free to shuffle that. ` alertname = Sensor over limit... [13:38:53] (03PS2) 10Vgutierrez: haproxy: Stop sending X-Analytics-TLS to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1181133 [13:38:53] (03PS2) 10Vgutierrez: varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 [13:39:13] (03PS1) 10Máté Szabó: varnish: don't set GeoIP cookie for hcaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1181138 [13:39:59] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in [13:40:18] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:40:35] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:43:00] (03CR) 10Giuseppe Lavagetto: [C:03+1] "Logic is sound and the vogonscript bits look correct; I haven't tried them though." [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez) [13:43:10] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [13:44:29] (03PS3) 10Vgutierrez: haproxy: Stop sending X-Analytics-TLS to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1181133 [13:44:29] (03PS3) 10Vgutierrez: varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 [13:45:40] RECOVERY - MariaDB Replica SQL: s5 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:46:38] RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:46:40] RECOVERY - MariaDB Replica Lag: s5 on db1154 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:46:40] RECOVERY - MariaDB Replica Lag: s5 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:47:00] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:48:59] (03PS4) 10Vgutierrez: haproxy: Stop sending X-Analytics-TLS to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1181133 [13:48:59] (03PS4) 10Vgutierrez: varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 [13:51:03] (03CR) 10Ebernhardson: [C:03+2] cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [13:51:08] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:51:51] (03Merged) 10jenkins-bot: cirrus: drop rc0 streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114956 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [13:52:05] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:54:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [13:58:31] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11110698 (10Jclark-ctr) @VRiley-WMF @wiki_willy Ben mentioned in previous tickets of both of ours to check that Data-Platform-SRE is added as a subtask. This might help get some t... [14:00:13] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11110705 (10VRiley-WMF) Thank you @Jclark-ctr ! [14:04:23] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:05:12] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bookworm [14:05:20] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11110717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1045.eqiad.wmnet with OS bookworm [14:17:53] (03CR) 10JHathaway: [C:03+1] Blacklist orangefs [puppet] - 10https://gerrit.wikimedia.org/r/1181110 (owner: 10Muehlenhoff) [14:34:06] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1045.eqiad.wmnet with reason: host reimage [14:37:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1045.eqiad.wmnet with reason: host reimage [14:49:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:17] (03CR) 10Xcollazo: [C:03+1] "LGTM after considering the comment above." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181127 (https://phabricator.wikimedia.org/T402644) (owner: 10Brouberol) [14:50:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.503 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:51:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:13] (03PS2) 10Brouberol: mediawiki: change the dump pods fsGroupChangePolicy to speed up volume mount [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181127 (https://phabricator.wikimedia.org/T402644) [14:54:12] (03CR) 10Brouberol: mediawiki: change the dump pods fsGroupChangePolicy to speed up volume mount (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181127 (https://phabricator.wikimedia.org/T402644) (owner: 10Brouberol) [14:57:02] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:00:07] vriley@cumin1003 reimage (PID 3209740) is awaiting input [15:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:06:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T399249)', diff saved to https://phabricator.wikimedia.org/P81708 and previous config saved to /var/cache/conftool/dbconfig/20250822-150602-fceratto.json [15:06:07] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:08:38] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:17:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.209 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:20:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P81709 and previous config saved to /var/cache/conftool/dbconfig/20250822-152109-fceratto.json [15:24:34] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:38] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:33:09] (03CR) 10DCausse: [C:03+1] EventStream: Enable hive ingeestion for wcqs-external.sparql-query (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (owner: 10Ebernhardson) [15:33:52] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11111013 (10phaultfinder) [15:34:17] (03CR) 10Brouberol: [C:03+2] mediawiki: change the dump pods fsGroupChangePolicy to speed up volume mount [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181127 (https://phabricator.wikimedia.org/T402644) (owner: 10Brouberol) [15:36:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P81710 and previous config saved to /var/cache/conftool/dbconfig/20250822-153617-fceratto.json [15:38:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11111039 (10phaultfinder) [15:43:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:43:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.443 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:20] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 1.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:48:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:48:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:50:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.550 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:50:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.891 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T399249)', diff saved to https://phabricator.wikimedia.org/P81711 and previous config saved to /var/cache/conftool/dbconfig/20250822-155124-fceratto.json [15:51:30] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:51:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [15:59:04] (03CR) 10BCornwall: [C:03+1] Failover webproxy in ulsfo to new node [dns] - 10https://gerrit.wikimedia.org/r/1181096 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [15:59:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:01] (03CR) 10Herron: [C:03+1] logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite) [16:01:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:59] (03Abandoned) 10BCornwall: provision: Adjust thermal profile for F4 [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [16:07:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.513 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:07:23] (03Abandoned) 10BCornwall: site.pp: Add new insetup::traffic codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139559 (https://phabricator.wikimedia.org/T392851) (owner: 10BCornwall) [16:08:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 6.632 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:00] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181158 [16:12:06] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181158 (owner: 10Ncmonitor) [16:12:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.084 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 6.660 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:10] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181111 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [16:42:55] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181111 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [16:43:33] (03PS1) 10Scott French: mediawiki: clean up php.version overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) [16:47:30] PROBLEM - BFD status on ssw1-e1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:47:32] PROBLEM - BFD status on ssw1-f1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:48:10] FIRING: [2x] BFDdown: BFD session down between ssw1-f1-codfw and 10.192.255.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:53:10] FIRING: [4x] BFDdown: BFD session down between ssw1-e1-codfw and 10.192.255.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:55:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:55:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:56:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:56:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.670 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:52] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181163 [17:07:46] (03CR) 10Dzahn: [C:03+2] site: add peopleweb role to people1005 [puppet] - 10https://gerrit.wikimedia.org/r/1180990 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [17:12:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:12:24] (03CR) 10RLazarus: [C:03+1] mediawiki: clean up php.version overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:12:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:12:39] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [17:13:10] FIRING: [20x] BFDdown: BFD session down between ssw1-e1-codfw and 10.192.255.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:13:11] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [17:14:13] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:14:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.624 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:14:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.734 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:14:25] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:15:05] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [17:17:02] 06SRE: provide envoyproxy package on trixie - https://phabricator.wikimedia.org/T402668 (10Dzahn) 03NEW [17:17:25] we are going to need envoy packaged for trixie [17:19:05] mutante: already in progress as part of the version upgrade :) [17:19:27] cool! thanks, rzl [17:20:06] should be present in component/envoy-future already, at the next version, and it'll be copied into main sometime next week when we roll that version out everywhere [17:20:26] you can merge that into T402584 if you like, or do something subtasky if you prefer :) [17:20:27] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:20:36] (03PS1) 10Dzahn: Revert "site: add peopleweb role to people1005" [puppet] - 10https://gerrit.wikimedia.org/r/1181168 [17:21:07] FIRING: ProbeDown: Service people1005:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:21:25] 06SRE: provide envoyproxy package on trixie - https://phabricator.wikimedia.org/T402668#11111270 (10Dzahn) [17:21:28] 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11111269 (10Dzahn) [17:21:29] making it a subtask, done [17:21:32] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181163 (owner: 10Ncmonitor) [17:22:05] 👍 [17:22:12] this kind of thing was exactly what I wanted to find out by just testing the people role on trixie [17:22:50] ah cool, I was going to say if you need it sooner we can just copy the existing package into trixie -- the magic of static binaries [17:22:51] no rush at all, mostly was curious [17:22:58] but sounds like no need, yeah [17:23:02] yep:) [17:24:40] (03CR) 10Dzahn: [C:03+2] Revert "site: add peopleweb role to people1005" [puppet] - 10https://gerrit.wikimedia.org/r/1181168 (owner: 10Dzahn) [17:25:06] !log sudo -i docker-registryctl delete-tags docker-registry.discovery.wmnet/envoy-future:1.26.8-1 # T402584 [17:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:01] PROBLEM - people.wikimedia.org requires authentication on people1005 is CRITICAL: connect to address 10.64.32.95 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:31:45] meh:) as usual when you add a role and the monitoring gets applied right away.. that will go away soon [17:32:42] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on people1005.eqiad.wmnet with reason: T402596 [17:32:47] T402596: upgrade people servers to trixie - https://phabricator.wikimedia.org/T402596 [17:35:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage [17:38:10] FIRING: [20x] BFDdown: BFD session down between ssw1-e1-codfw and 10.192.255.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:38:21] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.131 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:38:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.868 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:38:57] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181170 [17:39:25] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage [17:39:39] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181170 (owner: 10Ncmonitor) [17:53:02] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11111369 (10RLazarus) @ecarg Just a heads-up, we've broken the config out into per-team files to make it a little ea... [17:53:51] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1002.eqiad.wmnet with OS bookworm [17:54:36] (03PS2) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [18:00:09] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181172 [18:01:10] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [18:01:29] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181172 (owner: 10Ncmonitor) [18:03:15] (03PS3) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [18:09:39] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [18:13:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:13:32] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:19] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.894 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:22:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:22:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [18:22:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1045.eqiad.wmnet with OS bookworm [18:22:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11111467 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1045.eqiad.wmnet with OS bookworm completed: - cloudcephosd1045 (**PASS**... [18:23:10] FIRING: [20x] BFDdown: BFD session down between ssw1-e1-codfw and 10.192.255.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:23:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.993 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:17] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [18:28:47] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045 [18:29:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11111491 (10VRiley-WMF) [18:29:38] RECOVERY - BFD status on ssw1-f1-codfw.mgmt is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:29:38] RECOVERY - BFD status on ssw1-e1-codfw.mgmt is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:29:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11111495 (10VRiley-WMF) 05Open→03Resolved Everything should be completed with this ticket [18:30:25] (03PS4) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [18:31:02] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro [18:32:56] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181175 [18:33:10] RESOLVED: [20x] BFDdown: BFD session down between ssw1-e1-codfw and 10.192.255.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:35:58] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181175 (owner: 10Ncmonitor) [18:36:54] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [18:41:12] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:42:48] (03PS1) 10Andrew Bogott: New attempt to put new cloudcephosd hosts online [puppet] - 10https://gerrit.wikimedia.org/r/1181178 (https://phabricator.wikimedia.org/T401693) [18:46:33] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bullseye [18:51:14] !log krinkle@deploy1003 Started deploy [integration/docroot@5918d5e]: Add support for lcov [18:51:25] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:51:27] !log krinkle@deploy1003 Finished deploy [integration/docroot@5918d5e]: Add support for lcov (duration: 00m 21s) [18:56:42] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181181 [18:56:51] !log krinkle@deploy1003 Started deploy [integration/docroot@2d9ffad]: (no justification provided) [18:57:02] !log krinkle@deploy1003 Finished deploy [integration/docroot@2d9ffad]: (no justification provided) (duration: 00m 11s) [18:57:25] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1045.eqiad.wmnet with OS bullseye [18:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:57:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bullseye [18:58:26] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181181 (owner: 10Ncmonitor) [19:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:05:51] (03PS4) 10Krinkle: InitialiseSettings: Factor out ext-MobileFrontend.php to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011189 [19:05:53] (03Abandoned) 10Krinkle: InitialiseSettings: Factor out ext-MobileFrontend.php to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011189 (owner: 10Krinkle) [19:07:53] (03Abandoned) 10Krinkle: CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063913 (https://phabricator.wikimedia.org/T365265) (owner: 10Krinkle) [19:25:05] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1045.eqiad.wmnet with reason: host reimage [19:27:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:27:55] PROBLEM - HTTPS non-canonical-redirect-33 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to verify wikiwpedia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, voyagewiki.c [19:27:55] gewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [19:29:00] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1045.eqiad.wmnet with reason: host reimage [19:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:31:39] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:57] RECOVERY - HTTPS non-canonical-redirect-33 on ncredir7003 is OK: SSL OK - Certificate wikiwpedia.com valid until 2025-11-20 18:04:21 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:32:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 6.782 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:34:05] (03PS1) 10Andrew Bogott: Add new dummy ssh keys for trove VMs [labs/private] - 10https://gerrit.wikimedia.org/r/1181184 (https://phabricator.wikimedia.org/T402317) [19:34:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.389 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:35:00] (03PS2) 10Andrew Bogott: New attempt to put new cloudcephosd hosts online [puppet] - 10https://gerrit.wikimedia.org/r/1181178 (https://phabricator.wikimedia.org/T401693) [19:35:01] (03PS1) 10Andrew Bogott: Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) [19:35:14] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add new dummy ssh keys for trove VMs [labs/private] - 10https://gerrit.wikimedia.org/r/1181184 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [19:35:28] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:35:28] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:35:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [19:36:59] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11111793 (10Josve05a) I say that this ticket can be closed, as yet another of the VRT tickets has been confirmed to have had their issues fixed. [19:37:12] (03Abandoned) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [19:37:15] (03CR) 10CI reject: [V:04-1] Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [19:37:35] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:38:19] (03PS2) 10Andrew Bogott: Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) [19:38:45] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181186 [19:38:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11111796 (10phaultfinder) [19:40:28] 06SRE, 06Traffic, 07SecTeam-Processed: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11111803 (10sbassett) 05Open→03Resolved a:03sbassett [19:42:20] (03PS3) 10Andrew Bogott: Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) [19:42:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [19:42:37] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:43:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11111828 (10phaultfinder) [19:45:15] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181186 (owner: 10Ncmonitor) [19:45:43] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1045.eqiad.wmnet with OS bullseye [19:46:34] (03PS4) 10Andrew Bogott: Trove: install backdoor VM keys on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) [19:46:56] (03CR) 10Andrew Bogott: [C:03+2] New attempt to put new cloudcephosd hosts online [puppet] - 10https://gerrit.wikimedia.org/r/1181178 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [19:51:40] (03PS4) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [19:51:45] (03PS6) 10Scott French: P:etcd::tlsproxy: fix notify behavior for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) [19:51:45] (03PS4) 10Scott French: hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1164298 (https://phabricator.wikimedia.org/T352245) [19:55:09] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [19:55:17] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164298 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [19:56:32] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:56:43] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:57:30] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11111861 (10Andrew) [19:58:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11111862 (10Andrew) 05Resolved→03Open This is getting very close! I still see ping failures with cloudcephosd1045, probably because the second network connection isn't properly configured... [19:58:46] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:58:58] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:05:11] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:05:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:05:39] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:50] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:07:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.952 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:07:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:08:57] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181191 [20:09:13] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:10:52] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:12:18] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181191 (owner: 10Ncmonitor) [20:13:40] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:15:15] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:17:53] (03PS5) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [20:20:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:20:40] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:21:39] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:21:46] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:23:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 8.574 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:24:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.011 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:24:49] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:26:28] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:27:45] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181193 [20:32:36] (03PS6) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [20:32:45] (03PS8) 10Scott French: hieradata: use cfssl/pki for nginx on all codfw configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245) [20:32:46] (03PS9) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) [20:33:28] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181193 (owner: 10Ncmonitor) [20:34:28] (03CR) 10Nanocloud: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [20:34:31] (03CR) 10Nanocloud: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [20:35:05] (03Abandoned) 10Máté Szabó: varnish: don't set GeoIP cookie for hcaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1181138 (owner: 10Máté Szabó) [20:36:37] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host people2004.codfw.wmnet [20:36:38] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:37:38] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:37:40] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:38:48] (03PS1) 10Dzahn: site: add regex to include people2004 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1181194 (https://phabricator.wikimedia.org/T402596) [20:39:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.408 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:39:36] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:40:24] (03CR) 10Dzahn: [C:03+2] site: add regex to include people2004 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1181194 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [20:41:18] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2004.codfw.wmnet - dzahn@cumin1002" [20:41:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2004.codfw.wmnet - dzahn@cumin1002" [20:41:23] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:41:23] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache people2004.codfw.wmnet on all recursors [20:41:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people2004.codfw.wmnet on all recursors [20:41:56] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2004.codfw.wmnet - dzahn@cumin1002" [20:42:01] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2004.codfw.wmnet - dzahn@cumin1002" [20:44:07] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host people2004.codfw.wmnet with OS trixie [20:47:30] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181195 [20:50:20] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181195 (owner: 10Ncmonitor) [21:03:04] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on people2004.codfw.wmnet with reason: host reimage [21:08:09] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on people2004.codfw.wmnet with reason: host reimage [21:14:09] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:15:48] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:23:04] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host people2004.codfw.wmnet with OS trixie [21:23:04] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people2004.codfw.wmnet [21:24:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:24:47] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:15] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181197 [21:27:25] (03PS1) 10Dduvall: profile::buildkitd: Standalone buildkitd profile [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) [21:27:29] (03PS1) 10Dduvall: deployment_server: buildkitd for MediaWiki image builds [puppet] - 10https://gerrit.wikimedia.org/r/1181199 (https://phabricator.wikimedia.org/T392526) [21:27:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.168 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:55] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181197 (owner: 10Ncmonitor) [21:35:22] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2006.codfw.wmnet with reason: sleep test [21:51:25] (03CR) 10Nanocloud: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [22:09:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:09:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.526 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.616 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:23:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:23:52] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:25:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1247.eqiad.wmnet with reason: Maintenance [22:25:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T399249)', diff saved to https://phabricator.wikimedia.org/P81712 and previous config saved to /var/cache/conftool/dbconfig/20250822-222526-fceratto.json [22:25:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:25:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.344 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:25:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.473 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:34] (03PS1) 10Dzahn: gerrit: block another subnet owned by AS136907 [puppet] - 10https://gerrit.wikimedia.org/r/1181210 [22:35:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:36:16] (03CR) 10Dzahn: [C:03+2] gerrit: block another subnet owned by AS136907 [puppet] - 10https://gerrit.wikimedia.org/r/1181210 (owner: 10Dzahn) [22:36:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.540 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:36:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 1.704 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:02:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.032 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:52] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:21] I am going to increase the check interval on that one to reduce noise. [23:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:10:18] (03PS1) 10Dzahn: lists::monitoring: increase check_interval to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1181213 [23:14:06] (03PS2) 10Dzahn: lists::monitoring: increase check_interval to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1181213 [23:20:37] (03CR) 10Dzahn: [C:03+2] lists::monitoring: increase check_interval to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1181213 (owner: 10Dzahn) [23:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181214 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181214 (owner: 10TrainBranchBot) [23:39:48] (03CR) 10Dzahn: "I noticed in puppet compiler output that it is switching the docker-registry URL from discovery.wmnet to wikimedia.org in the systemd serv" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [23:43:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11112229 (10phaultfinder) [23:48:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11112230 (10phaultfinder) [23:51:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:53:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181214 (owner: 10TrainBranchBot) [23:53:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.389 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:53:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 1.515 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring