[00:01:56] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) The new SSD was picked up as `/dev/sdd` (instead of `/dev/sdc`), so I rebooted the host (and the new device came up as `sdc`). Afterward, I copied the partition t... [00:03:57] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Raise LoginNotify minimum log level to info T174200 (duration: 06m 51s) [00:04:04] T174200: Make logging more sensible - https://phabricator.wikimedia.org/T174200 [00:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:09:46] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [00:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:13:58] (03PS1) 10Tim Starling: Hooks: Do not attempt user creation when there's no username [extensions/LoginNotify] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953662 (https://phabricator.wikimedia.org/T345373) [00:14:42] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:31] (03CR) 10Tim Starling: [C: 03+2] Hooks: Do not attempt user creation when there's no username [extensions/LoginNotify] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953662 (https://phabricator.wikimedia.org/T345373) (owner: 10Tim Starling) [00:25:01] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:27:46] (03Merged) 10jenkins-bot: Hooks: Do not attempt user creation when there's no username [extensions/LoginNotify] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953662 (https://phabricator.wikimedia.org/T345373) (owner: 10Tim Starling) [00:29:43] 10SRE-tools, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10Fabfur) [00:30:57] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10ssingh) [00:31:05] 10SRE, 10Traffic, 10Epic: Deploy Wikimedia DNS: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [00:39:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953496 [00:39:12] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953496 (owner: 10TrainBranchBot) [00:43:28] !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.24/extensions/LoginNotify/includes/Hooks.php: fix production error T345373 (duration: 06m 13s) [00:43:31] T345373: TypeError: Argument 1 passed to MediaWiki\User\UserFactory::newFromName() must be of the type string, null given, called in /srv/mediawiki/php-1.41.0-wmf.24/extensions/LoginNotify/includes/Hooks.php on line 42 - https://phabricator.wikimedia.org/T345373 [00:45:39] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953496 (owner: 10TrainBranchBot) [02:08:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:13] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.239 second response time https://wikitech.wikimedia.org/wiki/Swift [02:39:33] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Swift [03:33:13] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:41:45] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2023-08-30 14:26:13 (4751 GiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:46:01] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:29] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2023-08-30 14:25:04 (4751 GiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:14:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:57] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:32] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068 (owner: 10Muehlenhoff) [05:17:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [05:21:35] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [05:25:57] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:30:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [05:31:31] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1606 days) https://wikitech.wikimedia.org/wiki/Logs [05:31:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [05:31:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [05:33:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:36:57] (03CR) 10Muehlenhoff: [C: 03+1] "Nice detective work on https://phabricator.wikimedia.org/T344829!" [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [05:38:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [05:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230901T0600) [06:01:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:09:29] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2023-08-30 14:26:13 (4813 GiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:13:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [06:15:54] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10GPSLeo) I think this is not linked to the error ra... [06:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [06:20:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [06:21:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [06:25:19] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2023-08-30 14:25:04 (4813 GiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:26:55] (03PS1) 10Muehlenhoff: Remove bast4004/bast5003/bast6002 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/954150 [06:30:10] (03PS1) 10Marostegui: db1119: Testing host [puppet] - 10https://gerrit.wikimedia.org/r/954151 [06:30:36] (03CR) 10Marostegui: [C: 03+2] db1119: Testing host [puppet] - 10https://gerrit.wikimedia.org/r/954151 (owner: 10Marostegui) [06:31:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast4004/bast5003/bast6002 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/954150 (owner: 10Muehlenhoff) [06:31:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [06:32:24] (03PS1) 10Marostegui: wiki-replicas.sql: querysampler is back on the wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/954152 [06:33:17] (03CR) 10Marostegui: [C: 03+2] wiki-replicas.sql: querysampler is back on the wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/954152 (owner: 10Marostegui) [06:33:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:03] 10SRE, 10Anti-Harassment, 10Data-Engineering, 10Traffic, and 2 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947 (10kostajh) [06:38:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [06:38:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [06:43:46] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Marostegui) [06:44:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [06:46:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [06:49:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [06:55:52] (03CR) 10Filippo Giunchedi: [C: 03+1] librenms: Add PHP version for Debian Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/954143 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [06:56:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "No concerns on my end, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [06:58:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230901T0700) [07:00:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [07:04:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [07:05:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [07:05:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [07:05:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2002.wikimedia.org [07:05:45] (03CR) 10Filippo Giunchedi: "LGTM, I've added some WMCS folks as heads-up" [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah) [07:09:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2002.wikimedia.org [07:16:22] (03PS2) 10Filippo Giunchedi: hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) [07:16:26] !log failover Ganeti master in eqsin to ganeti5004 [07:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:22] PROBLEM - ganeti-wconfd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:20:37] (03CR) 10Filippo Giunchedi: hieradata: add jaeger collector to service catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [07:25:22] (03PS6) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [07:25:24] (03PS1) 10Filippo Giunchedi: mesh: new networkpolicy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/954210 (https://phabricator.wikimedia.org/T320563) [07:25:30] (03CR) 10Filippo Giunchedi: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [07:29:31] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [07:31:00] (03CR) 10Abijeet Patro: [C: 03+1] Update MinT to 2023-08-31-061147-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [07:34:05] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.214 second response time https://wikitech.wikimedia.org/wiki/Swift [07:34:27] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 173 [07:34:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 173 [07:34:59] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.541 second response time https://wikitech.wikimedia.org/wiki/Swift [07:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:40:09] (03PS1) 10Muehlenhoff: graphite: Remove stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/954211 [07:43:10] (03PS1) 10Muehlenhoff: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/954212 [07:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:44:32] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: Remove stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/954211 (owner: 10Muehlenhoff) [07:44:44] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Remove stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/954211 (owner: 10Muehlenhoff) [07:44:55] PROBLEM - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:20] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah) [07:46:46] (03CR) 10Muehlenhoff: [C: 03+2] Update comment [puppet] - 10https://gerrit.wikimedia.org/r/954212 (owner: 10Muehlenhoff) [07:47:15] RECOVERY - Host an-worker1145 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:47:45] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:26] (03PS3) 10Filippo Giunchedi: hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) [07:52:31] (03CR) 10JMeybohm: [C: 03+1] hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [07:56:51] (03PS1) 10Muehlenhoff: Additional antelope updates for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954216 [07:57:50] (03PS1) 10Muehlenhoff: profile::simplelamp2: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954217 [07:58:28] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a3-codfw - cmooney@cumin1001" [07:58:28] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:01:40] (03CR) 10Muehlenhoff: [C: 03+2] profile::simplelamp2: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954217 (owner: 10Muehlenhoff) [08:13:48] (03PS2) 10Cathal Mooney: Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) [08:14:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:08] (03PS1) 10Filippo Giunchedi: wmnet: add jaeger records for ingress [dns] - 10https://gerrit.wikimedia.org/r/954218 (https://phabricator.wikimedia.org/T344253) [08:20:41] (03PS2) 10Gehel: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:21:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [08:24:49] (03PS1) 10JMeybohm: services_proxy: Lower wikifunctions timeout to 15s [puppet] - 10https://gerrit.wikimedia.org/r/954219 [08:26:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [08:30:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a3-codfw.mgmt.codfw.wmnet [08:32:15] (03PS1) 10Muehlenhoff: prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 [08:32:38] (03CR) 10CI reject: [V: 04-1] prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff) [08:32:42] (03CR) 10Clément Goubert: [C: 03+2] mw-api-ext, mw-web: Raise total replicas to 13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954000 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [08:33:23] (03Merged) 10jenkins-bot: mw-api-ext, mw-web: Raise total replicas to 13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954000 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [08:33:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [08:33:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [08:34:04] (03CR) 10Hashar: "I would link directly to the sign up page instead: https://idm.wikimedia.org/signup/" [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede) [08:34:16] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a4-codfw.mgmt.codfw.wmnet [08:34:18] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:34:51] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [08:35:12] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [08:35:55] (03PS2) 10Muehlenhoff: prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 [08:36:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [08:36:43] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a4-codfw - cmooney@cumin1001" [08:37:02] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [08:37:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a4-codfw - cmooney@cumin1001" [08:37:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:38:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [08:38:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [08:39:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [08:39:47] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [08:40:04] !log Raised mw-web and mw-api-ext capacity by ~30% - T341780 [08:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:07] T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 [08:40:18] (03CR) 10JMeybohm: [C: 03+1] wmnet: add jaeger records for ingress [dns] - 10https://gerrit.wikimedia.org/r/954218 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [08:41:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] services_proxy: Lower wikifunctions timeout to 15s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954219 (owner: 10JMeybohm) [08:42:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff) [08:45:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [08:46:49] (03PS2) 10JMeybohm: services_proxy: Lower wikifunctions timeout to 15s [puppet] - 10https://gerrit.wikimedia.org/r/954219 [08:49:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on testreduce1002.eqiad.wmnet with reason: WIP [08:49:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on testreduce1002.eqiad.wmnet with reason: WIP [08:49:23] (03CR) 10JMeybohm: services_proxy: Lower wikifunctions timeout to 15s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954219 (owner: 10JMeybohm) [08:49:53] (03PS1) 10Muehlenhoff: Add parsoid::testreduce role to testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/954221 (https://phabricator.wikimedia.org/T345220) [08:50:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [08:51:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2003.codfw.wmnet [08:54:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2003.codfw.wmnet [08:59:34] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:00:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:02:19] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise traffic to 4% [puppet] - 10https://gerrit.wikimedia.org/r/954002 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [09:02:24] !log Push 4% of global traffic to mw-on-k8s - T341780 [09:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:27] T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 [09:02:32] moritzm: head's up ^ [09:04:07] !log Running puppet on 'A:cp-text and P{P:trafficserver::backend}' - T341780 [09:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:29] thx [09:05:19] (03CR) 10LSobanski: "Adding Eoghan as Jelto is out for a while." [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [09:08:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a4-codfw.mgmt.codfw.wmnet [09:11:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I've also checked that net.ipv4.route.min_pmtu is also available in 4.19/Buster." [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [09:14:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [09:14:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [09:14:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1003.eqiad.wmnet [09:15:04] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: add jaeger records for ingress [dns] - 10https://gerrit.wikimedia.org/r/954218 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [09:18:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1003.eqiad.wmnet [09:19:54] (03PS1) 10Elukey: admin_ng: avoid knative resource watch/lookup for system ns on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/954224 [09:21:13] (03CR) 10Elukey: "More info https://knative.dev/docs/serving/webhook-customizations/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954224 (owner: 10Elukey) [09:22:02] (03CR) 10FNegri: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/954216 (owner: 10Muehlenhoff) [09:22:52] (03PS2) 10Elukey: admin_ng: avoid knative resource watch/lookup for system ns on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/954224 [09:23:23] (03PS3) 10Elukey: admin_ng: avoid knative resource watch/lookup for system ns on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/954224 [09:24:06] (03CR) 10Vgutierrez: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [09:25:54] (03CR) 10Elukey: [C: 03+2] admin_ng: avoid knative resource watch/lookup for system ns on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/954224 (owner: 10Elukey) [09:26:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [09:29:35] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:29:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:31:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [09:32:14] (03CR) 10Muehlenhoff: [C: 03+2] Additional antelope updates for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954216 (owner: 10Muehlenhoff) [09:32:31] (03CR) 10Majavah: [V: 03+1 C: 03+2] icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah) [09:34:08] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:34:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [09:35:47] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:35:49] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a5-codfw.mgmt.codfw.wmnet [09:35:51] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:36:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:36:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet [09:37:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:37:35] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) @RobH db1131 is no longer a master, it can be moved if we want to. We just need to depool it and stop mariadb beforehand. [09:37:46] (03PS3) 10Cathal Mooney: Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) [09:37:49] (03PS3) 10JMeybohm: services_proxy: Lower wikifunctions timeout to 15s [puppet] - 10https://gerrit.wikimedia.org/r/954219 [09:37:51] (03CR) 10Marostegui: [C: 03+1] prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff) [09:38:01] (03CR) 10Cathal Mooney: Correct sysctl value for net.ipv4.tcp_min_snd_mss (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [09:38:02] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a5-codfw - cmooney@cumin1001" [09:38:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [09:38:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a5-codfw - cmooney@cumin1001" [09:38:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:38:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:13] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:39:25] (03CR) 10Muehlenhoff: [C: 03+2] Add parsoid::testreduce role to testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/954221 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [09:40:17] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:41:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:42:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [09:43:04] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10ayounsi) [09:43:12] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [09:44:17] (03PS1) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [09:46:26] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43110/console" [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [09:46:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:48:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:50:32] (03PS1) 10Elukey: admin_ng: allow webhook to watch istio/knative ns and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/954247 [09:50:42] (03PS1) 10Mvolz: rest-gateway: fix citoid regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/954248 (https://phabricator.wikimedia.org/T329049) [09:53:23] (03PS2) 10Elukey: admin_ng: allow webhook to watch istio/knative ns and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/954247 [09:53:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:58] (03PS2) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [09:55:00] (03PS1) 10Clément Goubert: P:mediawiki::periodic_job: Add splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) [09:55:29] (03CR) 10CI reject: [V: 04-1] P:mediawiki::periodic_job: Add splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [09:57:29] (03CR) 10Jcrespo: [C: 03+1] prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff) [09:58:08] (03PS10) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [09:58:45] (03PS2) 10Clément Goubert: P:mediawiki::periodic_job: Add splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) [09:58:47] (03PS3) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [09:58:58] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10cmooney) >>! In T329272#9129584, @ayounsi wrote: > I'm curious to know what @cmooney thinks about removing parent/child for... [09:59:10] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [09:59:33] (03CR) 10Elukey: [C: 03+2] admin_ng: allow webhook to watch istio/knative ns and increase resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/954247 (owner: 10Elukey) [10:00:33] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43113/console" [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [10:02:20] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43114/console" [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [10:02:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:03:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:03:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [10:03:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:03:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet [10:04:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:05:02] (03CR) 10Mvolz: "Tested the regex in regexpal only." [deployment-charts] - 10https://gerrit.wikimedia.org/r/954248 (https://phabricator.wikimedia.org/T329049) (owner: 10Mvolz) [10:05:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:05:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:06:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:07:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:07:43] (03PS1) 10LSobanski: aptrepo: update gitlab-ce & gitlab-runner to 16.1 [puppet] - 10https://gerrit.wikimedia.org/r/954252 (https://phabricator.wikimedia.org/T345395) [10:09:16] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Xover) Yesterday, while uploading [[ https://commo... [10:10:09] (03CR) 10Jbond: confd: -prefix from confd cli to confd::file instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [10:10:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a5-codfw.mgmt.codfw.wmnet [10:10:18] jouncebot: nowandnext [10:10:18] For the next 20 hour(s) and 49 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230901T0700) [10:10:18] In 20 hour(s) and 49 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230902T0700) [10:11:20] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet [10:11:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:12:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:12:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:12:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [10:12:55] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet [10:13:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [10:14:16] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwdebug2001.codfw.wmnet [10:14:25] (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 16.1 [puppet] - 10https://gerrit.wikimedia.org/r/954252 (https://phabricator.wikimedia.org/T345395) (owner: 10LSobanski) [10:14:38] (03CR) 10Jbond: confd: -prefix from confd cli to confd::file instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [10:16:19] (03CR) 10LSobanski: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 16.1 [puppet] - 10https://gerrit.wikimedia.org/r/954252 (https://phabricator.wikimedia.org/T345395) (owner: 10LSobanski) [10:17:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:18:45] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug2001.codfw.wmnet [10:19:26] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwdebug2002.codfw.wmnet [10:21:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [10:21:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet [10:22:38] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:24:34] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug2002.codfw.wmnet [10:24:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [10:24:41] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet [10:24:58] (03PS1) 10Elukey: knative-serving: disable HPA by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/954256 [10:25:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] services_proxy: Lower wikifunctions timeout to 15s [puppet] - 10https://gerrit.wikimedia.org/r/954219 (owner: 10JMeybohm) [10:25:30] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a6-codfw.mgmt.codfw.wmnet [10:25:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:28:24] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:29:18] (03CR) 10Elukey: [C: 03+2] knative-serving: disable HPA by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/954256 (owner: 10Elukey) [10:30:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:27] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a6-codfw - cmooney@cumin1001" [10:31:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a6-codfw - cmooney@cumin1001" [10:31:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:32:23] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:32:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [10:33:02] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet [10:33:58] (03PS1) 10Muehlenhoff: Simplify IPMI check [puppet] - 10https://gerrit.wikimedia.org/r/954257 [10:34:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:34:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:35:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:35:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [10:35:57] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet [10:39:00] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43115/console" [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [10:40:18] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:46] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [10:41:22] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.39 ms [10:42:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [10:42:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:42:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [10:43:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:44:40] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet [10:45:18] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:02] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [10:51:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:51:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:53:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:53:41] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:54:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [10:55:23] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [10:56:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:56:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:57:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:58:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:58:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:58:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:59:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [11:00:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10joanna_borun) Approved [11:00:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:01:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:01:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:01:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [11:01:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:02:22] (03PS4) 10JMeybohm: services_proxy: Lower wikifunctions timeout to 15.5s [puppet] - 10https://gerrit.wikimedia.org/r/954219 [11:02:23] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [11:02:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a6-codfw.mgmt.codfw.wmnet [11:02:46] (03PS1) 10Hashar: tox: allow /bin/sh for tslua environment [puppet] - 10https://gerrit.wikimedia.org/r/954261 (https://phabricator.wikimedia.org/T345152) [11:03:17] (03CR) 10Jbond: [C: 03+2] run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [11:03:51] (03CR) 10JMeybohm: [C: 03+2] services_proxy: Lower wikifunctions timeout to 15.5s [puppet] - 10https://gerrit.wikimedia.org/r/954219 (owner: 10JMeybohm) [11:04:18] (03PS1) 10Alexandros Kosiaris: toolhub: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954262 (https://phabricator.wikimedia.org/T340843) [11:04:20] (03PS1) 10Alexandros Kosiaris: toolhub: Remove hardcoded dbproxies, enable MariaDB egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/954263 (https://phabricator.wikimedia.org/T340843) [11:04:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:00] (03CR) 10CI reject: [V: 04-1] toolhub: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954262 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [11:05:02] (03CR) 10CI reject: [V: 04-1] toolhub: Remove hardcoded dbproxies, enable MariaDB egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/954263 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [11:07:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) [11:07:49] (03CR) 10Jcrespo: [C: 03+1] "We got approval: https://phabricator.wikimedia.org/T345343#9136409" [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [11:07:58] (03PS8) 10Jcrespo: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [11:08:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [11:09:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Just an update. The cookbook is now working to both add the initial configuration and upgrade/downgrade the devi... [11:09:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::kubeadm: remove version defaults [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah) [11:14:09] (03PS2) 10Alexandros Kosiaris: toolhub: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954262 (https://phabricator.wikimedia.org/T340843) [11:14:11] (03PS2) 10Alexandros Kosiaris: toolhub: Remove hardcoded dbproxies, enable MariaDB egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/954263 (https://phabricator.wikimedia.org/T340843) [11:14:55] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I restarted the swift frontends, an... [11:15:03] (03CR) 10Jcrespo: [C: 03+2] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [11:15:20] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+1] gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [11:17:12] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10jbond) >>! In T158757#9135499, @nshahquinn-wmf wrote: >> however the correct fix for this is to migrate any services relying on the puppet agent certificates for TLS to... [11:23:21] (03PS1) 10Majavah: nagios_common: run tests with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954264 (https://phabricator.wikimedia.org/T345152) [11:23:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) a:05joanna_borun→03ABran-WMF Change has been deployed- @ABran-WMF please follow instructions at https://wikitech.wikimedia.org/wiki/SRE/Production_access#Sett... [11:26:17] (03PS1) 10Alexandros Kosiaris: ipoid: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954265 (https://phabricator.wikimedia.org/T340843) [11:26:19] (03PS1) 10Alexandros Kosiaris: ipoid: Remove hardcoded dbproxy stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/954266 (https://phabricator.wikimedia.org/T340843) [11:26:59] (03CR) 10CI reject: [V: 04-1] ipoid: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954265 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [11:27:01] (03CR) 10CI reject: [V: 04-1] ipoid: Remove hardcoded dbproxy stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/954266 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [11:29:40] (03PS1) 10Majavah: tox: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [11:30:53] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely: - Create new ldap-rw1001/ldap-rw2001 VMs using Bookworm a... [11:31:21] (03PS2) 10Majavah: tox: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [11:32:39] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) Feel also free to propose additional patches to tune your dot files- e.g.: {https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/files/home/jynus... [11:35:17] (03PS3) 10Majavah: taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [11:35:19] (03PS1) 10Majavah: taskgen: Run full tox CI when updating taskgen [puppet] - 10https://gerrit.wikimedia.org/r/954268 [11:37:57] (03PS4) 10Majavah: taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [11:41:11] (03CR) 10CI reject: [V: 04-1] taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [11:42:28] (03PS5) 10Majavah: taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [11:42:30] (03PS1) 10Majavah: taskgen: Improve shebang detection logic [puppet] - 10https://gerrit.wikimedia.org/r/954269 [11:44:46] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a7-codfw.mgmt.codfw.wmnet [11:44:48] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:45:53] (03CR) 10CI reject: [V: 04-1] taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [11:47:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [11:47:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [11:49:35] (03PS1) 10Majavah: graphite: Drop jessie/stretch Python 2 files [puppet] - 10https://gerrit.wikimedia.org/r/954271 [11:51:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43117/console" [puppet] - 10https://gerrit.wikimedia.org/r/954271 (owner: 10Majavah) [11:51:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:53:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [11:53:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [11:55:11] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [11:55:30] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [11:56:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:17] (03PS1) 10David Caro: wmcs-replica-cnf: redirects mock services stderr to stdout [puppet] - 10https://gerrit.wikimedia.org/r/954272 [11:59:00] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [11:59:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [12:00:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [12:01:18] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:02:46] (03PS1) 10Muehlenhoff: lxc: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954274 [12:03:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [12:03:16] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [12:04:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [12:07:31] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [12:09:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:05] (03PS2) 10Alexandros Kosiaris: ipoid: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954265 (https://phabricator.wikimedia.org/T340843) [12:10:07] (03PS2) 10Alexandros Kosiaris: ipoid: Remove hardcoded dbproxy stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/954266 (https://phabricator.wikimedia.org/T340843) [12:12:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] toolhub: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954262 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:12:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] toolhub: Remove hardcoded dbproxies, enable MariaDB egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/954263 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:12:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] ipoid: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954265 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:12:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] ipoid: Remove hardcoded dbproxy stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/954266 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:13:24] (03Merged) 10jenkins-bot: toolhub: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954262 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:13:28] (03Merged) 10jenkins-bot: toolhub: Remove hardcoded dbproxies, enable MariaDB egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/954263 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:13:36] (03Merged) 10jenkins-bot: ipoid: Update to utilize MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/954265 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:13:40] (03Merged) 10jenkins-bot: ipoid: Remove hardcoded dbproxy stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/954266 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:14:57] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:41] (03PS1) 10Muehlenhoff: etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276 [12:15:43] (03PS1) 10Muehlenhoff: autoinstall: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954277 [12:16:15] (03PS1) 10Ayounsi: infra_devices: remove parents for multihomed devices [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) [12:16:37] (03PS2) 10Ayounsi: infra_devices: remove parents for multihomed devices [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) [12:16:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954271 (owner: 10Majavah) [12:17:01] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [12:17:02] (03CR) 10Jbond: [C: 03+1] Simplify IPMI check [puppet] - 10https://gerrit.wikimedia.org/r/954257 (owner: 10Muehlenhoff) [12:17:13] (03PS2) 10Hashar: wmcs-replica-cnf: redirects mock services stderr to stdout [puppet] - 10https://gerrit.wikimedia.org/r/954272 (https://phabricator.wikimedia.org/T345152) (owner: 10David Caro) [12:18:03] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi) [12:18:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [12:18:22] (03CR) 10Hashar: [C: 03+1] "Tested locally with tox 4.8.0 and the tox red coloring for stderr is gone :]" [puppet] - 10https://gerrit.wikimedia.org/r/954272 (https://phabricator.wikimedia.org/T345152) (owner: 10David Caro) [12:18:57] (03CR) 10Majavah: [V: 03+1 C: 03+2] graphite: Drop jessie/stretch Python 2 files [puppet] - 10https://gerrit.wikimedia.org/r/954271 (owner: 10Majavah) [12:20:04] (03CR) 10David Caro: [C: 03+2] wmcs-replica-cnf: redirects mock services stderr to stdout [puppet] - 10https://gerrit.wikimedia.org/r/954272 (https://phabricator.wikimedia.org/T345152) (owner: 10David Caro) [12:20:12] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10akosiaris) >>! In T331699#9136475, @MoritzMuehlenhoff wrote: > One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely: > > - Create... [12:21:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/954268 (owner: 10Majavah) [12:21:55] (03PS1) 10Muehlenhoff: Remove obsolete OS conditionals [puppet] - 10https://gerrit.wikimedia.org/r/954279 [12:22:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954279 (owner: 10Muehlenhoff) [12:23:35] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [12:23:46] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [12:23:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [12:24:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [12:24:15] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [12:24:58] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [12:25:31] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [12:25:51] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [12:25:53] (03CR) 10Jbond: "lgtm but see inline for improvement" [puppet] - 10https://gerrit.wikimedia.org/r/954269 (owner: 10Majavah) [12:26:18] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:55] (03CR) 10Majavah: [C: 03+2] taskgen: Run full tox CI when updating taskgen [puppet] - 10https://gerrit.wikimedia.org/r/954268 (owner: 10Majavah) [12:29:23] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: eno1 has interface errors - https://phabricator.wikimedia.org/T345430 (10aborrero) [12:30:35] (03CR) 10Jbond: taskgen: Improve shebang detection logic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954269 (owner: 10Majavah) [12:31:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [12:31:11] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [12:31:47] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [12:31:52] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [12:31:58] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [12:32:22] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [12:32:24] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [12:32:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [12:32:43] (03CR) 10Vgutierrez: [C: 03+1] Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [12:32:57] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: eno1 has interface errors - https://phabricator.wikimedia.org/T345430 (10aborrero) hey @Jclark-ctr could you please double check the cable of this server? [12:36:03] (03PS1) 10Alexandros Kosiaris: toolhub: Pass extraFQDNs to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/954281 [12:36:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:18] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:38:13] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/954261 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [12:38:22] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [12:38:25] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a7-codfw - cmooney@cumin1001" [12:38:40] (03CR) 10David Caro: [C: 03+2] tox: allow /bin/sh for tslua environment [puppet] - 10https://gerrit.wikimedia.org/r/954261 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [12:39:05] (03CR) 10Jbond: "TBH i think we should just drop this default hander all together. from what i can see the only files that would hit this are the followin" [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [12:39:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a7-codfw - cmooney@cumin1001" [12:39:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:39:38] (03CR) 10Cathal Mooney: [C: 03+2] Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [12:39:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/954269 (owner: 10Majavah) [12:40:22] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [12:40:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [12:40:52] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [12:41:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] toolhub: Pass extraFQDNs to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/954281 (owner: 10Alexandros Kosiaris) [12:42:17] (03Merged) 10jenkins-bot: toolhub: Pass extraFQDNs to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/954281 (owner: 10Alexandros Kosiaris) [12:44:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:29] !log Updated CI Job operations-puppet-tests-buster-docker to use tox 4.8.0 # T345152 [12:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:32] T345152: [ci,operations-puppet] tox does not detect changes inside requirement files - https://phabricator.wikimedia.org/T345152 [12:45:14] (03CR) 10Majavah: [C: 03+2] taskgen: Improve shebang detection logic [puppet] - 10https://gerrit.wikimedia.org/r/954269 (owner: 10Majavah) [12:45:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet [12:48:03] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10ABran-WMF) ` ~ $ ssh -v bast1003.eqiad.wmnet OpenSSH_9.0p1, OpenSSL 3.0.9 30 May 2023 debug1: Reading configuration data /home/doo/.ssh/config debug1: /home/doo/.ssh/config line 28: Applying option... [12:48:10] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:48:41] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [12:49:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:22] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:06] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10ABran-WMF) Ultra verbose: ` ~ $ ssh -o identitiesonly=yes -o forwardagent=no -o kbdinteractiveauthentication=no -opasswordauthentication=no -l arnaudb -i ~/.ssh/id_ed25519 -vvvvvvvvv bast4004.wiki... [12:50:13] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [12:51:16] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [12:51:43] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:43] (03PS6) 10Majavah: taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [12:52:45] (03PS1) 10Majavah: taskgen: log number of unknown python files [puppet] - 10https://gerrit.wikimedia.org/r/954285 [12:52:48] (03PS1) 10Majavah: taskgen: also match pytest files [puppet] - 10https://gerrit.wikimedia.org/r/954286 [12:53:38] (03CR) 10David Caro: [C: 03+1] "yay! 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [12:54:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet [12:54:34] !log Build /releng/operations-puppet:0.9.0 image and now updated the CI Job operations-puppet-tests-buster-docker to use tox 4.8.0 # T345152 [12:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] T345152: [ci,operations-puppet] tox does not detect changes inside requirement files - https://phabricator.wikimedia.org/T345152 [12:54:41] dcaro: puppet now has tox 4.8 :] [12:54:46] (03CR) 10Majavah: taskgen: Assume *.py files without a shebang are Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [12:54:48] (03PS1) 10Muehlenhoff: pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287 [12:55:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet [12:55:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [12:55:57] (03CR) 10Majavah: [C: 03+1] "looks fine, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/954279 (owner: 10Muehlenhoff) [12:56:01] (03PS1) 10Elukey: admin_ng: raise knative-serving's webhook pods to 4 in prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/954288 [12:56:17] (03CR) 10CI reject: [V: 04-1] taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [12:56:29] (03PS2) 10Elukey: admin_ng: raise knative-serving's webhook pods to 4 in prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/954288 (https://phabricator.wikimedia.org/T344058) [12:58:31] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:58:40] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:59:40] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [13:00:02] (03CR) 10Jbond: [C: 03+1] "lgtm optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/954286 (owner: 10Majavah) [13:00:29] !log lsobanski@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security release [13:00:47] (03PS2) 10Majavah: taskgen: also match pytest files [puppet] - 10https://gerrit.wikimedia.org/r/954286 [13:00:49] (03PS7) 10Majavah: taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 [13:01:00] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:05] (03CR) 10Majavah: taskgen: also match pytest files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954286 (owner: 10Majavah) [13:02:03] (03CR) 10Herron: [C: 03+1] librenms: Add PHP version for Debian Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/954143 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [13:02:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet [13:03:23] (03CR) 10CI reject: [V: 04-1] taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [13:04:18] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:04:28] (03PS1) 10Majavah: admin: drop GenSysadminTable.py [puppet] - 10https://gerrit.wikimedia.org/r/954289 [13:06:33] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [13:07:46] hashar: thanks! \o/ [13:08:22] (03CR) 10Ladsgroup: [C: 03+1] prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff) [13:10:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a7-codfw.mgmt.codfw.wmnet [13:10:42] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete OS conditionals [puppet] - 10https://gerrit.wikimedia.org/r/954279 (owner: 10Muehlenhoff) [13:11:08] (03CR) 10Filippo Giunchedi: "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/954264 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [13:11:15] (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common: run tests with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954264 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [13:11:36] (03PS2) 10Majavah: nagios_common: run tests with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954264 (https://phabricator.wikimedia.org/T345152) [13:11:49] (03CR) 10Elukey: [C: 03+2] admin_ng: raise knative-serving's webhook pods to 4 in prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/954288 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [13:11:51] (03CR) 10Filippo Giunchedi: [C: 03+1] infra_devices: remove parents for multihomed devices [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi) [13:11:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi) [13:13:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:14:05] (03CR) 10Majavah: [C: 03+2] nagios_common: run tests with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954264 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [13:14:11] (03PS1) 10Alexandros Kosiaris: toolhub: make extraFQDNs specific to codfw, eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/954290 [13:15:06] (03PS6) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) [13:15:09] (03CR) 10Alexandros Kosiaris: "This avoids a replacement of staging.svc.eqiad.wmnet with toolhub.wikimedia.org when deploying the staging release." [deployment-charts] - 10https://gerrit.wikimedia.org/r/954290 (owner: 10Alexandros Kosiaris) [13:16:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:16:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:18:21] (03CR) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [13:18:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet [13:18:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [13:19:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:22:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:22:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10fgiunchedi) >>! In T329272#9129584, @ayounsi wrote: > I'm also curious to know @fgiunchedi if/how alertmanager handles it.... [13:23:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:24:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:24:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:24:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet [13:25:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [13:25:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet [13:25:58] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet [13:26:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:06] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply security updates - bking@cumin1001 - T344587 [13:28:14] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10fnegri) We have quite a few cookbooks in WMCS that are read-only and used to show a cluster status or similar things. Logging to SAL every time someone runs o... [13:28:52] (03CR) 10Jbond: "So we have a problem, i checked all the files that currently have no shbang and they all compiled fine with `python3 -m py_compile` howeve" [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [13:29:54] (03CR) 10Alexandros Kosiaris: Update modules/README.md (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris) [13:30:00] RECOVERY - Ganeti memory on ganeti1019 is OK: OK Memory 82% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [13:30:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:40] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345356 (10Papaul) 05Open→03Resolved a:03Papaul @cmooney is doing some work on the new switches [13:31:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:24] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:31:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:31:59] (03PS1) 10Jbond: admin: Remove GenSysadminTable.py [puppet] - 10https://gerrit.wikimedia.org/r/954292 [13:32:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet [13:33:00] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet [13:33:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet [13:33:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet [13:33:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:34:26] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:43] (SystemdUnitFailed) firing: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1006:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:36:20] (03PS8) 10Jbond: taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [13:36:22] (03PS1) 10Jbond: udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 [13:36:48] (03CR) 10CI reject: [V: 04-1] udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [13:38:47] (03CR) 10CI reject: [V: 04-1] taskgen: Assume *.py files without a shebang are Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/954267 (owner: 10Majavah) [13:39:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:43] (SystemdUnitFailed) firing: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1006:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1001.eqiad.wmnet [13:40:03] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet [13:40:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet [13:40:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet [13:43:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:43:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/954286 (owner: 10Majavah) [13:44:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:27] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [13:44:43] (SystemdUnitFailed) firing: (7) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1005:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:12] (03CR) 10Majavah: [C: 03+1] "looks like we had the same idea :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/954289/" [puppet] - 10https://gerrit.wikimedia.org/r/954292 (owner: 10Jbond) [13:46:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet [13:46:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1002.eqiad.wmnet [13:46:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [13:47:21] (03CR) 10Jbond: [C: 03+1] "ahh i missed you already did this lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/954289 (owner: 10Majavah) [13:47:41] (03CR) 10Majavah: [C: 03+2] admin: drop GenSysadminTable.py [puppet] - 10https://gerrit.wikimedia.org/r/954289 (owner: 10Majavah) [13:47:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:48:08] (03CR) 10Jbond: admin: Remove GenSysadminTable.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954292 (owner: 10Jbond) [13:48:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:48:19] (03Abandoned) 10Jbond: admin: Remove GenSysadminTable.py [puppet] - 10https://gerrit.wikimedia.org/r/954292 (owner: 10Jbond) [13:49:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:49:46] (03PS2) 10Majavah: udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [13:49:48] (03PS1) 10Majavah: taskgen: update for tox 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/954297 (https://phabricator.wikimedia.org/T345152) [13:49:56] (03CR) 10Jbond: "not sure what is wrong with ci at the moment" [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [13:50:24] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:50:33] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:51:18] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:43] (03PS1) 10Andrew Bogott: cloudservices1006: specify recursor name [puppet] - 10https://gerrit.wikimedia.org/r/954298 [13:53:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet [13:53:46] (03CR) 10Jbond: [C: 03+1] infra_devices: remove parents for multihomed devices [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi) [13:53:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet [13:53:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:05] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T345380 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm no external sign to which let power. logged in and found PSU2 was having the issues. unplugged and unseated PSU2 for 20 seconds. reinserted. Alert cleared. [13:54:08] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:19] (03Abandoned) 10Andrew Bogott: cloudservices1006: specify recursor name [puppet] - 10https://gerrit.wikimedia.org/r/954298 (owner: 10Andrew Bogott) [13:55:33] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:55:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:59:08] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:53] (03CR) 10Jbond: [C: 03+1] "thanks was just trying to dig into this 😊, we the hell tox got upgraded to a major version on a friday im not sure, suspect this could bre" [puppet] - 10https://gerrit.wikimedia.org/r/954297 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [14:00:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:01:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1004.eqiad.wmnet [14:02:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (3) WCQS_Streaming_Updater in codfw (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:02:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [14:03:20] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (3) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:06:31] (03CR) 10Filippo Giunchedi: [C: 03+1] udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [14:07:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (4) WCQS_Streaming_Updater in codfw (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:07:34] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (4) WCQS_Streaming_Updater in codfw (k8s) is unstable- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:08:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (3) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:12:02] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10ABran-WMF) 05Open→03Resolved [x] Bastion access validated [x] Host access validated Thanks, I'll try with the default config to see what needs to be tweaked for me [14:12:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2036.codfw.wmnet with OS bullseye [14:12:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2036.codfw.wmnet with OS bullseye [14:15:12] /6 [14:16:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [14:16:50] (03PS1) 10JMeybohm: jeager: Fix GRPC traffic to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/954301 (https://phabricator.wikimedia.org/T344253) [14:17:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet [14:17:38] (03PS2) 10JMeybohm: jeager: Fix GRPC traffic to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/954301 (https://phabricator.wikimedia.org/T344253) [14:17:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:17:46] (03CR) 10Filippo Giunchedi: [C: 03+1] jeager: Fix GRPC traffic to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/954301 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [14:18:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:18:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:38] (03CR) 10JMeybohm: [C: 03+2] jeager: Fix GRPC traffic to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/954301 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [14:20:05] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:20:34] (03Merged) 10jenkins-bot: jeager: Fix GRPC traffic to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/954301 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [14:21:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:13] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:21:17] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [14:21:21] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [14:21:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:23:07] !log lsobanski@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security release [14:23:07] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [14:23:11] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [14:23:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:24:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:24:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2002.codfw.wmnet [14:28:23] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2001.codfw.wmnet [14:28:24] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:29:05] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [14:29:09] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [14:29:38] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:30:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:30:32] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [14:31:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:32:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2036.codfw.wmnet with reason: host reimage [14:32:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2036.codfw.wmnet with reason: host reimage [14:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:33:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:34:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:36:08] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:37:27] (03CR) 10Herron: "something to get the ball rolling, please see related task" [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [14:38:03] (03PS9) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) [14:38:56] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2001.codfw.wmnet - bking@cumin1001" [14:38:56] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:56] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2001.codfw.wmnet on all recursors [14:39:00] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2001.codfw.wmnet on all recursors [14:39:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2036.codfw.wmnet with OS bullseye [14:39:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2036.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [14:39:24] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001" [14:39:43] (SystemdUnitFailed) firing: (4) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1003:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:16] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2001.codfw.wmnet - bking@cumin1001" [14:41:08] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw- https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:41:34] (03PS3) 10Jbond: udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 [14:41:56] (03CR) 10CI reject: [V: 04-1] udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [14:42:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2001.codfw.wmnet with OS bookworm [14:43:52] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [14:44:43] (SystemdUnitFailed) firing: (7) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1003:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:43] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:48] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a8-codfw.mgmt.codfw.wmnet [14:49:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:52:38] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a8-codfw - cmooney@cumin1001" [14:54:43] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:30] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:52] ^^ cloudelastic alerts are expected, should clear shortly [14:59:24] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply security updates - bking@cumin1001 - T344587 [14:59:30] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [14:59:43] (SystemdUnitFailed) firing: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1001:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:12] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:21] (03PS1) 10Andrew Bogott: cloudservices1006: use the new, internal address for db access [puppet] - 10https://gerrit.wikimedia.org/r/954304 [15:03:12] (03CR) 10Jbond: [C: 03+2] udp2log: add correct shebang to help CI [puppet] - 10https://gerrit.wikimedia.org/r/954294 (owner: 10Jbond) [15:03:59] (03PS2) 10Andrew Bogott: cloudservices1006: use the new, internal address for db access [puppet] - 10https://gerrit.wikimedia.org/r/954304 [15:04:15] (03PS10) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:04:43] (SystemdUnitFailed) firing: (7) prometheus-wmf-elasticsearch-exporter-9200.service Failed on cloudelastic1001:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:35] !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release [15:05:38] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1006: use the new, internal address for db access [puppet] - 10https://gerrit.wikimedia.org/r/954304 (owner: 10Andrew Bogott) [15:08:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet [15:11:50] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a8-codfw - cmooney@cumin1001" [15:11:50] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:28] (03CR) 10Ahmon Dancy: "I'm okay with this change as long as we still get the email for success or failure." [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [15:15:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet [15:15:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet [15:20:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2004.codfw.wmnet [15:25:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:15] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10ssingh) Hi @Eevans, > I have also tried upgrading the NIC firmware (from 21.40.21 to 22.31.6) As per https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-speci... [15:31:45] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [15:33:26] (03CR) 10Jbond: [C: 03+1] taskgen: update for tox 4 syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954297 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [15:37:25] (03PS1) 10Andrew Bogott: openstack networktests: update test host name to tools-codfw1dev-k8s-worker-2 [puppet] - 10https://gerrit.wikimedia.org/r/954308 [15:38:10] (03CR) 10FNegri: [C: 03+1] openstack networktests: update test host name to tools-codfw1dev-k8s-worker-2 [puppet] - 10https://gerrit.wikimedia.org/r/954308 (owner: 10Andrew Bogott) [15:39:46] (03CR) 10Andrew Bogott: [C: 03+2] openstack networktests: update test host name to tools-codfw1dev-k8s-worker-2 [puppet] - 10https://gerrit.wikimedia.org/r/954308 (owner: 10Andrew Bogott) [15:43:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a8-codfw.mgmt.codfw.wmnet [15:51:49] (03PS1) 10Andrew Bogott: codfw1dev network tests: update auth dns IP [puppet] - 10https://gerrit.wikimedia.org/r/954310 [15:52:50] (03CR) 10FNegri: [C: 03+1] codfw1dev network tests: update auth dns IP [puppet] - 10https://gerrit.wikimedia.org/r/954310 (owner: 10Andrew Bogott) [15:53:34] (03PS2) 10Andrew Bogott: codfw1dev network tests: update dns server IPs [puppet] - 10https://gerrit.wikimedia.org/r/954310 [15:53:47] db1201 page [15:54:01] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev network tests: update dns server IPs [puppet] - 10https://gerrit.wikimedia.org/r/954310 (owner: 10Andrew Bogott) [15:54:06] probably the expired one [15:54:20] hmm, last update in SAL is that it's pooled? [15:54:20] db1201 looks up to me [15:54:28] I'd say it's resolved? [15:54:49] yeah [15:54:57] but the intermittent flap is probably not nice [15:55:06] Amir1: ^ we just got paged for db1201 as an FYI [15:55:28] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh5001.wikimedia.org with OS bookworm [15:55:38] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh5001.wikimedia.org with OS bookworm [15:56:07] sukhe: pfw3-codfw will re-page in about a half-hour too. Is it still being worked on? [15:56:27] cwhite: that was definitely a flap, we can resolve that [15:57:09] cool, done :) [15:57:12] thank you! [15:57:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2001.codfw.wmnet with OS bookworm [15:57:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2001.codfw.wmnet [16:00:11] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:02:06] ^ expected, brett is reimaging [16:03:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:31] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9136979, @ssingh wrote: > Hi @Eevans, > >> I have also tried upgrading the NIC firmware (from 21.40.21 to 22.31.6) > > As per https://wikitech.wik... [16:10:51] (03PS1) 10Andrew Bogott: codfw1dev network tests: use a new 'bastion' host [puppet] - 10https://gerrit.wikimedia.org/r/954314 [16:11:21] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev network tests: use a new 'bastion' host [puppet] - 10https://gerrit.wikimedia.org/r/954314 (owner: 10Andrew Bogott) [16:11:25] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:14:43] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2037.codfw.wmnet with OS bullseye [16:18:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye [16:19:01] !log T343983 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=amwiki --logwiki=metawiki Jean-Mahmood User92259453 [16:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:04] T343983: Error: Call to a member function getTimestamp() on null - https://phabricator.wikimedia.org/T343983 [16:19:39] sorry I was afk [16:19:44] let me take a look sukhe [16:19:57] thanks! seemed like it was intermittent and resolved in a few seconds [16:20:00] but worth checking IMHO [16:20:35] T345271 [16:20:35] T345271: db1201 network down - https://phabricator.wikimedia.org/T345271 [16:20:42] this happened yesterday as well [16:21:22] yeah, I remember this from yesterday [16:21:26] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [16:21:46] !log aokoth@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release [16:21:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2036.codfw.wmnet with OS bullseye [16:21:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2036.codfw.wmnet with OS bullseye [16:22:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2036.codfw.wmnet with reason: host reimage [16:22:57] RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2036.codfw.wmnet with reason: host reimage [16:23:30] there is nothing in kern.log and I haven't got an email for it. Maybe we never resolved it and the ack got expired after 24 hours? It is basically exactly 24 hours after the yesterday's page sukhe [16:23:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2036.codfw.wmnet with OS bullseye [16:23:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2036.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [16:23:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:29] Amir1: ok, possible. sorry about that, will resolve [16:24:58] just that I saw some work on sal so figured it might be something new but you are right, timing matches [16:25:07] there is nothing in IRC either [16:25:32] I have been doing a lot of automated maint in the past couple of weeks, that muddies the water. Sorry about that [16:26:32] will check resolve shortly so that it doesn't page over weekend. thanks! [16:27:10] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: eno1 has interface errors - https://phabricator.wikimedia.org/T345430 (10Jclark-ctr) @aborrero replaced cable. [16:28:14] thanks! [16:28:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:29:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) a:05Jhancock.wm→03Papaul [16:30:13] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:19] PROBLEM - config-master.wikimedia.org requires authentication on config-master2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:33:15] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:33:33] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:01] (03PS1) 10Majavah: O:config_master: enable mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/954316 (https://phabricator.wikimedia.org/T345452) [16:37:07] (03PS2) 10Majavah: O:config_master: enable mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/954316 (https://phabricator.wikimedia.org/T345452) [16:37:21] (03CR) 10Jbond: [C: 03+2] "thanks not sure how this got missed :/" [puppet] - 10https://gerrit.wikimedia.org/r/954316 (https://phabricator.wikimedia.org/T345452) (owner: 10Majavah) [16:40:13] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: mysql: expose tcp port to all internal networks [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) [16:40:23] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:27] RECOVERY - config-master.wikimedia.org requires authentication on config-master2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:44:08] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: haproxy: mysql: expose tcp port to all internal networks [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) [16:44:12] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [16:49:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [16:50:26] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [16:50:28] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [16:50:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [16:53:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [16:53:58] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [16:54:07] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: apply security updates - bking@cumin1001 - T344587 [16:55:49] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b2-codfw.mgmt.codfw.wmnet [16:55:51] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:58:02] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b2-codfw - cmooney@cumin1001" [16:58:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b2-codfw - cmooney@cumin1001" [16:58:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:43] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [16:59:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5001.wikimedia.org with reason: host reimage [17:01:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10cjming) [17:02:32] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: haproxy: mysql: expose tcp port to cloud-private networks only [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) [17:03:07] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [17:03:26] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: haproxy: mysql: expose tcp port to cloud-private networks only [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) [17:03:58] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm but I didn't audit what the cloud_private list actually is" [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [17:04:31] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [17:04:39] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:42] (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:57] PROBLEM - Check systemd state on elastic2065 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:41] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:57] RECOVERY - Check systemd state on elastic2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:21] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host doh5001.wikimedia.org with OS bookworm [17:06:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh5001.wikimedia.org with OS bookworm executed with errors: - doh5001 (**FAIL**) - Downtimed o... [17:06:39] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: haproxy: mysql: expose tcp port to cloud-private networks only [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) [17:07:10] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:07:56] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [17:09:42] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:37] !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Release [17:11:46] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new spine links. - cmooney@cumin1001" [17:12:51] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:03] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new spine links. - cmooney@cumin1001" [17:13:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:59] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:11] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:12] (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:27] (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10odimitrijevic) Approved [17:17:19] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2036'] [17:18:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2036'] [17:18:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2036'] [17:18:23] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2036'] [17:19:36] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh5001.wikimedia.org [17:19:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh5001.wikimedia.org [17:20:13] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:04] !incidents [17:21:05] No incidents occurred in the past 24 hours for team SRE [17:22:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:23:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [17:25:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) got 25-39 provisioned without errors. 25-39 had bios and idrac firmware upgraded without errors. did not downgrade NIC. got about mismatched component. But versi... [17:25:13] (SystemdUnitFailed) resolved: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2045:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:41] PROBLEM - Check systemd state on elastic2067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:41] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:13] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2046:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:13] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b2-codfw.mgmt.codfw.wmnet [17:31:07] RECOVERY - Check systemd state on elastic2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:07] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh3004.wikimedia.org with OS bookworm [17:31:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh3004.wikimedia.org with OS bookworm [17:35:01] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:35:13] (SystemdUnitFailed) resolved: (9) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:03] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:37:55] PROBLEM - Check systemd state on elastic2085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:31] ^me [17:38:33] well, not the elastic [17:38:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:11] brett: you can be what you want to be! [17:39:23] RECOVERY - Check systemd state on elastic2085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:02] Younger? [17:40:27] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:40:30] ha! [17:41:12] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2051:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:23] PROBLEM - Check systemd state on elastic2068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:32] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2050:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:40] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b3-codfw.mgmt.codfw.wmnet [17:46:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:46:49] RECOVERY - Check systemd state on elastic2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:52] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b3-codfw - cmooney@cumin1001" [17:49:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b3-codfw - cmooney@cumin1001" [17:49:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:50:27] (SystemdUnitFailed) resolved: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2050:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:53:01] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [17:53:03] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [17:53:14] brett: downtiming is failing as well? [17:53:39] yeah [17:53:52] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [17:53:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3004.wikimedia.org with reason: host reimage [17:54:03] oh ok, works now [17:54:35] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:35] PROBLEM - Check systemd state on elastic2069 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:01] PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:27] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:59] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:25] RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [17:58:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [18:00:15] RECOVERY - Check systemd state on elastic2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:27] (SystemdUnitFailed) resolved: (10) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:02:13] PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:35] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:19] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: partial-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:59] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:08] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host doh3004.wikimedia.org with OS bookworm [18:04:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh3004.wikimedia.org with OS bookworm executed with errors: - doh3004 (**FAIL**) - Downtimed o... [18:05:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2026.codfw.wmnet with OS bullseye [18:06:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2026.codfw.wmnet with OS bullseye [18:08:58] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:09:43] PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:43] RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:09] RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:12] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2037:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:49] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:12:23] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:14:58] 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/d/d7/Elizabeth_Sombart,_February,_2023.jpg" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T331800 (10Bugreporter) [18:16:12] (SystemdUnitFailed) resolved: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2037:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:58] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh3004.wikimedia.org [18:16:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh3004.wikimedia.org [18:17:55] PROBLEM - Check systemd state on elastic2077 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:21] PROBLEM - Check systemd state on elastic2079 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2027.codfw.wmnet with OS bullseye [18:19:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye [18:21:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b3-codfw.mgmt.codfw.wmnet [18:22:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bullseye [18:22:47] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices1006.eqiad.wmnet with OS bullseye [18:22:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye [18:22:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye exe... [18:23:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2028.codfw.wmnet with OS bullseye [18:23:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2028.codfw.wmnet with OS bullseye [18:25:07] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:19] PROBLEM - Check systemd state on elastic2041 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:25] RECOVERY - Check systemd state on elastic2079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:12] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2040:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:23] RECOVERY - Check systemd state on elastic2077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:33] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:45] RECOVERY - Check systemd state on elastic2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [18:31:12] (SystemdUnitFailed) resolved: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2040:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2029.codfw.wmnet with OS bullseye [18:31:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2029.codfw.wmnet with OS bullseye [18:32:43] PROBLEM - Check systemd state on elastic2078 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:03] PROBLEM - Check systemd state on elastic2063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:11] RECOVERY - Check systemd state on elastic2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:24] (ProbeDown) firing: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip6)- https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:29] RECOVERY - Check systemd state on elastic2063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:37] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3326 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [18:35:16] !log aokoth@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Release [18:36:05] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 120596 bytes in 1.368 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [18:39:00] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b4-codfw.mgmt.codfw.wmnet [18:39:02] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:39:24] (ProbeDown) resolved: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4)- https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [18:42:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [18:42:27] (SystemdUnitFailed) firing: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:43:12] (SystemdUnitFailed) resolved: (16) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2041:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2028.codfw.wmnet with reason: host reimage [18:46:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2028.codfw.wmnet with reason: host reimage [18:47:41] PROBLEM - Check systemd state on elastic2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:12] (SystemdUnitFailed) firing: (14) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh3003.wikimedia.org with OS bookworm [18:48:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2027.codfw.wmnet with OS bullseye [18:48:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [18:48:52] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh3003.wikimedia.org with OS bookworm [18:48:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [18:49:09] RECOVERY - Check systemd state on elastic2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2029.codfw.wmnet with reason: host reimage [18:51:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2029.codfw.wmnet with reason: host reimage [18:51:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors- https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:52:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:52:27] (SystemdUnitFailed) resolved: (14) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:49] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:52:55] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:53:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2028.codfw.wmnet with OS bullseye [18:53:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2028.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [18:55:19] PROBLEM - Check systemd state on elastic2083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:35] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors- https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:56:45] RECOVERY - Check systemd state on elastic2083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:03] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:57:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2029.codfw.wmnet with OS bullseye [18:57:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2029.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [18:58:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:12] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:57] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:57] PROBLEM - Check systemd state on elastic2084 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:59] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:01] RECOVERY - Check systemd state on elastic2084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:27] (SystemdUnitFailed) resolved: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2042:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2027.codfw.wmnet with OS bullseye [19:09:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye [19:10:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [19:10:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [19:11:24] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2027.codfw.wmnet with OS bullseye [19:11:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [19:11:46] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) From IRC: ` 1:29 PM urandom: looking at the server in netbox is looks like it racked in a 10G rack or it connected using 1g so in the pass when we had th... [19:12:51] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3003.wikimedia.org with reason: host reimage [19:12:53] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on doh3003.wikimedia.org with reason: host reimage [19:14:27] (SystemdUnitFailed) firing: (13) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2047:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:12] (SystemdUnitFailed) firing: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2052:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:41] PROBLEM - Check systemd state on elastic2076 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [19:20:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [19:20:30] (03CR) 10Andrea Denisse: [C: 03+2] librenms: Add PHP version for Debian Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/954143 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [19:21:01] RECOVERY - Check systemd state on elastic2076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2026.codfw.wmnet with OS bullseye [19:23:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2026.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [19:23:46] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host doh3003.wikimedia.org with OS bookworm [19:23:55] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh3003.wikimedia.org with OS bookworm executed with errors: - doh3003 (**FAIL**) - Downtimed o... [19:24:12] (SystemdUnitFailed) resolved: (15) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2052:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:49] PROBLEM - Check systemd state on elastic2073 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:55] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: apply security updates - bking@cumin1001 - T344587 [19:28:15] RECOVERY - Check systemd state on elastic2073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:41] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b4-codfw - cmooney@cumin1001" [19:30:00] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3003.wikimedia.org with reason: host reimage [19:30:13] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3003.wikimedia.org with reason: host reimage [19:32:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b4-codfw - cmooney@cumin1001" [19:32:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:37:27] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:37:53] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:47:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2025.codfw.wmnet with OS bullseye [19:47:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye [19:48:48] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:51:22] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new switch links codfw - cmooney@cumin1001" [19:52:46] (SystemdUnitFailed) firing: nginx.service Failed on wdqs1006:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new switch links codfw - cmooney@cumin1001" [19:56:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:57:42] (SystemdUnitFailed) resolved: nginx.service Failed on wdqs1006:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:09] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new switch links codfw - cmooney@cumin1001" [20:00:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new switch links codfw - cmooney@cumin1001" [20:00:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:03:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b4-codfw.mgmt.codfw.wmnet [20:04:03] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b5-codfw.mgmt.codfw.wmnet [20:04:04] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:11:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh3003.wikimedia.org [20:11:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh3003.wikimedia.org [20:15:05] 10SRE, 10Infrastructure-Foundations: Cookbook sre.puppet.sync-netbox-hiera sets 'public' var for all IPv6 GUA to true - https://phabricator.wikimedia.org/T345473 (10cmooney) p:05Triage→03Low [20:25:42] !log robh@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 173 [20:26:10] !log robh@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 173 [20:28:42] (SystemdUnitFailed) firing: nginx.service Failed on wdqs1007:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:42] (SystemdUnitFailed) resolved: nginx.service Failed on wdqs1007:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:12] (SystemdUnitFailed) firing: (2) nginx.service Failed on wdqs1007:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:12] (SystemdUnitFailed) resolved: (2) nginx.service Failed on wdqs1007:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:18] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @papaul thanks for the work documenting the cable IDs. I've put the ones from above in Netbox now. There is one discrepancy, the same label is listed for two... [20:49:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2025.codfw.wmnet with OS bullseye [20:49:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2025.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [20:53:12] (SystemdUnitFailed) firing: (2) nginx.service Failed on wdqs1012:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:53] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b5-codfw - cmooney@cumin1001" [20:57:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b5-codfw - cmooney@cumin1001" [20:57:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:58:12] (SystemdUnitFailed) resolved: (2) nginx.service Failed on wdqs1012:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:27] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b6-codfw.mgmt.codfw.wmnet [20:58:28] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:00:39] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b6-codfw - cmooney@cumin1001" [21:01:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b6-codfw - cmooney@cumin1001" [21:01:28] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:42] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b7-codfw.mgmt.codfw.wmnet [21:01:44] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:04:01] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b7-codfw - cmooney@cumin1001" [21:04:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b7-codfw - cmooney@cumin1001" [21:04:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:05:02] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b8-codfw.mgmt.codfw.wmnet [21:05:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:08:02] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b8-codfw - cmooney@cumin1001" [21:08:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-b8-codfw - cmooney@cumin1001" [21:08:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:10:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:11:22] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [21:16:42] (SystemdUnitFailed) firing: nginx.service Failed on wdqs1009:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:21:42] (SystemdUnitFailed) resolved: nginx.service Failed on wdqs1009:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:49] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [21:29:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b5-codfw.mgmt.codfw.wmnet [21:29:07] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup2002), Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:29:12] (SystemdUnitFailed) firing: (2) nginx.service Failed on wdqs1009:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:27] (SystemdUnitFailed) resolved: nginx.service Failed on wdqs1009:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [21:32:06] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh4002.wikimedia.org with OS bookworm [21:32:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh4002.wikimedia.org with OS bookworm [21:32:45] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b6-codfw.mgmt.codfw.wmnet [21:33:41] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:33:57] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:36:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b7-codfw.mgmt.codfw.wmnet [21:36:37] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:37:07] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:40:01] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:40:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-b8-codfw.mgmt.codfw.wmnet [21:52:41] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4002.wikimedia.org with reason: host reimage [21:52:43] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on doh4002.wikimedia.org with reason: host reimage [21:54:24] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a1-codfw [21:54:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a1-codfw [21:54:36] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a2-codfw [21:54:45] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a2-codfw [21:54:49] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a3-codfw [21:54:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10cmooney) 05Open→03Resolved a:03cmooney All are now upgraded to JUNOS 22.2R3.15. I used the opportunity to test the ZTP cookbook which is workin... [21:54:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a3-codfw [21:55:00] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a4-codfw [21:55:03] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10cmooney) [21:55:09] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a4-codfw [21:55:11] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [21:55:11] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a5-codfw [21:55:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a5-codfw [21:55:23] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a6-codfw [21:55:32] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a6-codfw [21:55:34] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a7-codfw [21:55:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a7-codfw [21:55:45] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-a8-codfw [21:55:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a8-codfw [21:55:57] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b2-codfw [21:56:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b2-codfw [21:56:08] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b3-codfw [21:56:17] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b3-codfw [21:56:19] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b4-codfw [21:56:29] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b4-codfw [21:56:31] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b5-codfw [21:56:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b5-codfw [21:56:42] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b6-codfw [21:56:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b6-codfw [21:56:54] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b7-codfw [21:57:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b7-codfw [21:57:05] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-b8-codfw [21:57:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b8-codfw [22:02:19] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host doh4002.wikimedia.org with OS bookworm [22:02:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh4002.wikimedia.org with OS bookworm executed with errors: - doh4002 (**FAIL**) - Downtimed o... [22:03:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:12:15] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T345480 (10phaultfinder) [22:13:05] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:14:01] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:22:18] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh4002.wikimedia.org [22:22:19] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh4002.wikimedia.org [22:22:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:44:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2027.codfw.wmnet with OS bullseye [22:44:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye [22:44:56] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:44:59] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-b5-codfw.mgmt.codfw.wmnet [22:45:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [22:45:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2027.codfw.wmnet with reason: host reimage [22:45:09] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:45:11] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-b4-codfw.mgmt.codfw.wmnet [22:45:31] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:45:33] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-a8-codfw.mgmt.codfw.wmnet [22:45:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2027.codfw.wmnet with OS bullseye [22:45:51] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-a5-codfw.mgmt.codfw.wmnet [22:45:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2027.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [22:46:07] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-a4-codfw.mgmt.codfw.wmnet [23:00:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) @akosiaris kubernetes2025 to 2029 was failing durin os install because of puppet, i login to 2025 console and manually did the puppet run i got the error ` Error: Co... [23:01:29] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [23:52:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [23:54:57] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw spine switch overlay loopbacks. - cmooney@cumin1001" [23:55:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw spine switch overlay loopbacks. - cmooney@cumin1001" [23:55:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)