[00:03:17] PROBLEM - Check systemd state on an-airflow1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:29] PROBLEM - Check systemd state on an-airflow1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:03] PROBLEM - Check systemd state on an-airflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:09] PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:07:35] (03PS1) 10Bartosz Dziewoński: Remove 'currentProto'/'finalProto'/'proto' business [extensions/CentralAuth] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967195 (https://phabricator.wikimedia.org/T348852) [00:10:07] (03CR) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [00:13:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [00:18:11] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966834 [00:39:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966834 (owner: 10TrainBranchBot) [00:52:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [00:54:37] 10SRE, 10Maps: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Nicolas_Raoul) Thanks Platonides for the prompt reply! Here is our User-Agent: "Commons/ (https://mediawiki.org/wiki/Apps/Commons) Android/" I d... [00:55:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:01:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966834 (owner: 10TrainBranchBot) [01:02:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [01:34:45] (03PS2) 10Krinkle: logging: Remove redundant setTimezone() call for UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) [01:34:56] (03PS8) 10Krinkle: logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) [01:35:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [01:35:41] (03CR) 10Krinkle: logging: Remove useMicrosecondTimestamps(false) calls (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle) [01:45:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [02:08:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [02:13:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [02:38:38] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.405242064199047s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:03:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [03:03:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [03:03:38] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:57] (03CR) 10Tim Starling: [C: 03+2] Enable source maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966945 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling) [03:06:38] (03Merged) 10jenkins-bot: Enable source maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966945 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling) [03:15:46] !log tstarling@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Enable source maps everywhere T47514 (duration: 06m 26s) [03:15:50] T47514: ResourceLoader: Implement support for Source Maps - https://phabricator.wikimedia.org/T47514 [03:20:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.1308272847399663s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:57:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:07:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:07:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:26:05] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10Papaul) @cmooney @Jhancock.wm checked the server, no IP address set on it and she did reset it but it didn't resolve the issue. I asked... [04:28:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:53:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [04:55:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:20:11] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [05:35:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [05:39:19] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: MariaDB 10.6.15 on bookworm [software] - 10https://gerrit.wikimedia.org/r/967331 (https://phabricator.wikimedia.org/T349165) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231020T0600) [06:00:36] (03PS1) 10Marostegui: instances.yaml: Remove db1119 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/967332 (https://phabricator.wikimedia.org/T349272) [06:13:06] (03CR) 10Arnaudb: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/967332 (https://phabricator.wikimedia.org/T349272) (owner: 10Marostegui) [06:13:29] RECOVERY - Check systemd state on an-airflow1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:27] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1119 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/967332 (https://phabricator.wikimedia.org/T349272) (owner: 10Marostegui) [06:18:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1119 from dbctl T349272', diff saved to https://phabricator.wikimedia.org/P53021 and previous config saved to /var/cache/conftool/dbconfig/20231020-061822-marostegui.json [06:18:24] (03PS1) 10Brouberol: Fix: the backend argument is required in our cryptography version [puppet] - 10https://gerrit.wikimedia.org/r/967333 [06:18:27] T349272: Move db1119 to m1 - https://phabricator.wikimedia.org/T349272 [06:20:37] (03PS1) 10Marostegui: mariadb: Move db1119 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/967334 (https://phabricator.wikimedia.org/T349272) [06:25:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:26:02] (03CR) 10Arnaudb: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/967334 (https://phabricator.wikimedia.org/T349272) (owner: 10Marostegui) [06:26:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/967334 (https://phabricator.wikimedia.org/T349272) (owner: 10Marostegui) [06:27:35] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bookworm: MariaDB 10.6.15 on bookworm [software] - 10https://gerrit.wikimedia.org/r/967331 (https://phabricator.wikimedia.org/T349165) (owner: 10Marostegui) [06:28:22] (03CR) 10Elukey: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [06:29:49] (03CR) 10Elukey: [C: 03+1] ml-services: add autoscaling for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/967230 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [06:35:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:50:48] (03CR) 10David Caro: Remove gerrit git from quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [06:51:47] (03CR) 10Majavah: [C: 03+2] striker: Bump container version to 2023-10-19-160227-production [puppet] - 10https://gerrit.wikimedia.org/r/967247 (https://phabricator.wikimedia.org/T348131) (owner: 10BryanDavis) [06:55:53] (03PS2) 10David Caro: quarry: use github remote [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [06:57:24] (03CR) 10David Caro: quarry: use github remote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231020T0700) [07:03:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:18] (03CR) 10David Caro: quarry: use github remote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [07:16:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:17:03] (03CR) 10Gergő Tisza: [C: 03+1] Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [07:21:25] !log increase etherpad1003 CPU and memory (1CPU,1GB -> 2CPU,2GB) - T348386 [07:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:30] T348386: ProbeDown - Etherpad - https://phabricator.wikimedia.org/T348386 [07:21:31] (03CR) 10Gergő Tisza: [C: 03+1] Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [07:22:54] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:23:36] (03CR) 10Gergő Tisza: [C: 03+1] Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński) [07:24:49] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on etherpad1003.eqiad.wmnet with reason: Reboot to use new CPU and memory config [07:25:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad1003.eqiad.wmnet with reason: Reboot to use new CPU and memory config [07:26:40] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-f8/ssw links - ayounsi@cumin1001" [07:27:06] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:27:12] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:27:57] 10SRE: plwiki has no favicon in Google - https://phabricator.wikimedia.org/T349361 (10Msz2001) [07:29:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-f8/ssw links - ayounsi@cumin1001" [07:29:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:31:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:31:50] dbproxy alerts are to be expected [07:32:21] (03PS2) 10Gergő Tisza: Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [07:32:23] (03PS2) 10Gergő Tisza: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [07:32:25] (03PS3) 10Gergő Tisza: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński) [07:32:27] (03PS1) 10Gergő Tisza: CentralAuth: Use second-level domain for cookies for www.* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) [07:32:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.735 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:36:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:36:12] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: add autoscaling for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/967230 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [07:36:59] (03Merged) 10jenkins-bot: ml-services: add autoscaling for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/967230 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [07:38:42] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:50] (03CR) 10Majavah: CentralAuth: Use second-level domain for cookies for www.* wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [07:43:56] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [07:44:16] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10ayounsi) Haha yeah indeed! In theory we should only keep 90 days of logs : https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modu... [08:03:17] (03CR) 10Btullis: [C: 03+1] "Great. The bullseye upgrades aren't far off, but thanks for sorting this in the meantime." [puppet] - 10https://gerrit.wikimedia.org/r/967333 (owner: 10Brouberol) [08:03:53] (03CR) 10Brouberol: [C: 03+2] Fix: the backend argument is required in our cryptography version [puppet] - 10https://gerrit.wikimedia.org/r/967333 (owner: 10Brouberol) [08:06:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) This is the oldest row: ` root@db1164.eqiad.wmnet[librenms]> select timestamp from syslog order by timestamp asc limit 1; +---------------------+ | timestamp... [08:07:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:12:43] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) 05Open→03In progress [08:12:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Long gone: https://codesearch.wmcloud.org/search/?q=IncludeLegacyJavaScript 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [08:13:25] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Beao) It seems like that's been repaired somehow, both thumbnail and file of... [08:17:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10jcrespo) Would it be possible to have it on filesystem/kibana only? I don't mind backing it up for persistence, but on db there is extra cost that wouldn't be on filesys... [08:28:32] (03CR) 10JMeybohm: [C: 04-1] images: Add Go 1.21 image, based on bookworm (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [08:29:08] (03CR) 10DCausse: "we should bring in WCQS I think to clarify how the two apps will be configured" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [08:33:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:34:16] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:51] (03PS3) 10Klausman: images: Add Go 1.21 images, based on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 [08:36:29] (03CR) 10Klausman: images: Add Go 1.21 images, based on bookworm (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [08:37:05] (03CR) 10JMeybohm: [C: 03+1] images: Add Go 1.21 images, based on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [08:38:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:41:52] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:42:06] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:43:48] !log brouberol@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1006.eqiad.wmnet [08:45:49] !log brouberol@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts kafka-jumbo1006.eqiad.wmnet [08:46:10] there;s still a reference to these hosts in helmfile.d/services/_mediawiki-common_/.fixtures.yaml [08:48:59] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:50:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:55:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:55:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:57:58] (03PS1) 10Brouberol: Replace the IPs of kafka-jumbo100[1-6] with kafka-jumbo101[0-5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/967400 (https://phabricator.wikimedia.org/T336044) [09:00:54] (03CR) 10Btullis: [C: 03+1] Replace the IPs of kafka-jumbo100[1-6] with kafka-jumbo101[0-5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/967400 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [09:01:41] (03CR) 10Brouberol: [C: 03+2] Replace the IPs of kafka-jumbo100[1-6] with kafka-jumbo101[0-5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/967400 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [09:01:44] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/966871 (https://phabricator.wikimedia.org/T349195) (owner: 10Majavah) [09:02:24] (03Merged) 10jenkins-bot: Replace the IPs of kafka-jumbo100[1-6] with kafka-jumbo101[0-5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/967400 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [09:07:55] (03CR) 10Majavah: [C: 03+2] openstack: encapi: don't try to close the connection [puppet] - 10https://gerrit.wikimedia.org/r/966871 (https://phabricator.wikimedia.org/T349195) (owner: 10Majavah) [09:10:06] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:24] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:52] RECOVERY - Check systemd state on an-airflow1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:12:14] RECOVERY - Check systemd state on an-airflow1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:18] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:19:50] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) You can see metrics about swift's memcached usage i... [09:21:52] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:23:16] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:28:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:35:52] (03PS1) 10JMeybohm: Update eventstreams to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967402 (https://phabricator.wikimedia.org/T300033) [09:36:32] PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:36:34] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:26] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:37:26] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:37:36] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:38:47] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) >>! In T348837#9253425, @cmooney wrote: > Regarding the UDP encapsulation it's an interesting idea, and is a reminder that currently our switches distribute flows based... [09:39:03] (03PS1) 10JMeybohm: Update calculator-service to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967403 (https://phabricator.wikimedia.org/T346638) [09:39:26] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:31] (03PS1) 10Brouberol: Export the certificate path as a label of the expiration date metric [puppet] - 10https://gerrit.wikimedia.org/r/967404 (https://phabricator.wikimedia.org/T329398) [09:41:54] (03PS1) 10JMeybohm: Update mobileapps to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967405 (https://phabricator.wikimedia.org/T300033) [09:47:44] (03CR) 10Btullis: [C: 03+1] Export the certificate path as a label of the expiration date metric [puppet] - 10https://gerrit.wikimedia.org/r/967404 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [09:47:58] (03PS1) 10Kevin Bazira: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966836 (https://phabricator.wikimedia.org/T348607) [09:48:11] (03CR) 10Brouberol: [C: 03+2] Export the certificate path as a label of the expiration date metric [puppet] - 10https://gerrit.wikimedia.org/r/967404 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [09:48:38] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:48:39] (03PS1) 10JMeybohm: Update recommendation-api to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967406 (https://phabricator.wikimedia.org/T300033) [09:49:26] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:41] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/105/co" [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [09:49:43] (03PS1) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [09:50:01] (03Abandoned) 10Hashar: Add a json representation of the build [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar) [09:50:38] (03CR) 10Hashar: [C: 04-1] "Same as https://gerrit.wikimedia.org/r/c/operations/software/puppet-compiler/+/967407 but on 2.x branch." [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [09:52:50] (03PS1) 10Elukey: profile::thanos: improve Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967408 (https://phabricator.wikimedia.org/T349072) [09:53:12] (03CR) 10Elukey: [C: 03+1] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966836 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:54:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10ayounsi) We already have it in Kibana, but the LibreNMS UI is quite convenient and we send more verbose logs for alerting there. The solution is probably to reduce the r... [09:54:38] (03PS2) 10Elukey: profile::thanos: improve Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967408 (https://phabricator.wikimedia.org/T349072) [09:55:22] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966836 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:56:11] (03Merged) 10jenkins-bot: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966836 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:56:31] (03PS3) 10Elukey: profile::thanos: improve Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967408 (https://phabricator.wikimedia.org/T349072) [09:57:10] (03PS1) 10JMeybohm: Update shellbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967410 (https://phabricator.wikimedia.org/T300033) [09:58:43] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:58:48] (03PS2) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) [09:59:33] (03CR) 10Jbond: [C: 04-1] [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [10:00:57] (03PS1) 10JMeybohm: Update termbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967412 (https://phabricator.wikimedia.org/T300033) [10:01:14] (03CR) 10Klausman: [C: 03+1] profile::thanos: improve Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967408 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [10:03:14] (03PS1) 10JMeybohm: Update wikifeeds to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967414 (https://phabricator.wikimedia.org/T300033) [10:04:44] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [10:05:07] (03PS1) 10JMeybohm: Update zotero to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967415 (https://phabricator.wikimedia.org/T300033) [10:05:52] (03PS3) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) [10:07:47] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [10:07:48] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 545 bytes in 1.776 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:08:16] (SLOMetricAbsent) firing: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:08:17] (ThanosRuleIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleIsDown [10:08:21] (ThanosStoreIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosStoreIsDown [10:08:25] (ThanosSidecarIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarIsDown [10:08:30] (ThanosQueryIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryIsDown [10:08:38] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:13] (03PS1) 10Ayounsi: Add PTR include for lsw1-f8 v6 uplinks IPs [dns] - 10https://gerrit.wikimedia.org/r/967416 [10:10:51] (03PS2) 10Ayounsi: Add PTR include for lsw1-f8 v6 uplinks IPs [dns] - 10https://gerrit.wikimedia.org/r/967416 [10:10:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/967408 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [10:11:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "Of course! Thank you, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/967291 (https://phabricator.wikimedia.org/T349102) (owner: 10Herron) [10:12:04] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:12:06] (03CR) 10Ayounsi: [C: 03+2] Add PTR include for lsw1-f8 v6 uplinks IPs [dns] - 10https://gerrit.wikimedia.org/r/967416 (owner: 10Ayounsi) [10:12:29] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-f8/ssw links - ayounsi@cumin1001" [10:12:47] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [10:13:17] (ThanosStoreIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosStoreIsDown [10:13:17] (ThanosRuleIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleIsDown [10:13:17] (ThanosQueryIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryIsDown [10:13:21] (ThanosSidecarIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarIsDown [10:13:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-f8/ssw links - ayounsi@cumin1001" [10:13:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:13:38] (JobUnavailable) firing: (7) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:26] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:32] (03CR) 10Elukey: [C: 03+2] profile::thanos: improve Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967408 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [10:15:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:18:27] sigh titan1001 isn't happy, I'll powercycle [10:19:21] !log powercycle titan1001 [10:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:21:10] PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:56] RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:23:00] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:23:21] (03CR) 10Elukey: images: Add Go 1.21 images, based on bookworm (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [10:23:24] RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:23:24] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:23:38] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:12] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:24:26] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:25:44] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:27:51] (03CR) 10Hashar: Add a json representation of the build (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/967214 (owner: 10Hashar) [10:28:16] (SLOMetricAbsent) resolved: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:32:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:33:32] (03PS2) 10Kevin Bazira: ml-services: update recommendation-api-ng max_candidates [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) [10:34:10] (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng max_candidates [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:34:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:41] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the reviews :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:36:33] (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng max_candidates [deployment-charts] - 10https://gerrit.wikimedia.org/r/966830 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [10:37:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:39:09] (03PS1) 10Jbond: wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) [10:39:11] (03PS1) 10Jbond: cassandra::instance: update to use wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) [10:39:36] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:39:52] (03CR) 10CI reject: [V: 04-1] wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [10:44:37] (03CR) 10Vgutierrez: [C: 03+1] "change looks good but it doesn't meet the requirements of the task, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [10:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:15] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:46:54] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:48:15] (03PS1) 10Elukey: slo_definitions: update all dashboards with the new Istio SLI metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/967423 (https://phabricator.wikimedia.org/T349072) [10:49:56] (03CR) 10Vgutierrez: "looking good, see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [10:51:46] (03PS16) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [10:51:48] (03PS1) 10Jbond: compile_redirects: move function to wmflib and namespace [puppet] - 10https://gerrit.wikimedia.org/r/967424 [10:51:50] (03PS1) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [10:52:32] (03PS2) 10Jbond: compile_redirects: move function to wmflib and namespace [puppet] - 10https://gerrit.wikimedia.org/r/967424 (https://phabricator.wikimedia.org/T348883) [10:52:46] (03PS2) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [10:52:59] (03PS17) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [10:53:07] (03PS18) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [10:53:11] 10SRE, 10Traffic, 10Patch-For-Review: HAProxy should use a single backend for Vanish - https://phabricator.wikimedia.org/T349287 (10Fabfur) [10:56:57] (03CR) 10CI reject: [V: 04-1] compile_redirects: move function to wmflib and namespace [puppet] - 10https://gerrit.wikimedia.org/r/967424 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [10:57:13] (03CR) 10CI reject: [V: 04-1] wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [10:57:21] (03Abandoned) 10Jbond: compile_redirects: move function to wmflib and namespace [puppet] - 10https://gerrit.wikimedia.org/r/967424 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [10:59:05] (03PS1) 10Elukey: profile::thanos: follow up on Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967426 (https://phabricator.wikimedia.org/T349072) [11:00:13] (03PS2) 10Elukey: profile::thanos: follow up on Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967426 (https://phabricator.wikimedia.org/T349072) [11:01:07] (03CR) 10jenkins-bot: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:04:21] (03PS3) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:04:23] (03PS19) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [11:07:17] (03CR) 10CI reject: [V: 04-1] wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:08:38] (03PS4) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:08:40] (03PS20) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [11:11:11] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967429 (https://phabricator.wikimedia.org/T348923) (owner: 10Michael Große) [11:11:48] (03PS14) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [11:12:31] (03CR) 10CI reject: [V: 04-1] wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:13:17] (03PS5) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:13:19] (03PS21) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [11:17:16] (03CR) 10CI reject: [V: 04-1] wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:18:09] (03CR) 10Jbond: "thanks response inline" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:18:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Apparently the new property is a strict superset:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967429 (https://phabricator.wikimedia.org/T348923) (owner: 10Michael Große) [11:22:04] (03PS6) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:22:06] (03PS22) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [11:24:06] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:24:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/108/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:30:01] (03PS7) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:31:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:34:01] (03PS8) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:34:15] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:35:27] (03PS1) 10Hnowlan: Revert "restbase: disable per-host icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/967202 [11:36:03] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:36:36] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be1003.eqiad.wmnet with OS bullseye [11:37:08] (03Abandoned) 10Jbond: compile_redirects: port compile_redirects to new API [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:37:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bullseye [11:37:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye [11:38:00] (03CR) 10CI reject: [V: 04-1] Revert "restbase: disable per-host icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/967202 (owner: 10Hnowlan) [11:40:02] (03PS9) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:41:10] (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:42:03] !log refactoring tables @ db1164[bbackups] T349360 [11:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:09] T349360: Clean up dbbackups.backup_files table - https://phabricator.wikimedia.org/T349360 [11:42:20] (03PS2) 10Hnowlan: Revert "restbase: disable per-host icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/967202 [11:43:05] (03CR) 10CI reject: [V: 04-1] wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [11:51:22] (03PS1) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [11:51:48] (03CR) 10CI reject: [V: 04-1] Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:51:53] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/113/co" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:52:06] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [11:55:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [11:58:37] (03PS15) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [11:58:50] (03PS10) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:59:33] (03PS11) 10Jbond: wmflib::compile_redirects: convert to modern API [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) [11:59:39] (03CR) 10Fabfur: [C: 04-1] "Sample hieradata switch on cp4037 to test with PCC, not ready for merge" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [12:03:54] (03PS16) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [12:03:59] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [12:05:24] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/116/con" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [12:07:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:11:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bullseye [12:11:49] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host moss-be1003.eqiad.wmnet with OS bullseye completed: - moss-... [12:18:31] (03PS2) 10Jbond: wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) [12:18:36] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [12:18:55] (03PS2) 10Jbond: cassandra::instance: update to use wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) [12:20:19] (03PS1) 10Hnowlan: media-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/967438 (https://phabricator.wikimedia.org/T347899) [12:20:59] (03CR) 10Klausman: [V: 03+2 C: 03+2] images: Add Go 1.21 images, based on bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967228 (owner: 10Klausman) [12:21:35] (03CR) 10CI reject: [V: 04-1] wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [12:22:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [12:23:21] (03PS3) 10Bartosz Dziewoński: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 [12:23:57] (03CR) 10Bartosz Dziewoński: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [12:26:55] (03PS3) 10Jbond: wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) [12:29:00] 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10WMF-Communications, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Pamputt) [12:29:58] (03CR) 10CI reject: [V: 04-1] wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [12:33:36] (03PS2) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [12:33:38] (03PS1) 10Hashar: Remove `defaultbranch=master` from .gitreview [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967440 (https://phabricator.wikimedia.org/T146293) [12:33:40] (03PS1) 10Hashar: tox: add HTML and branch coverage to pytest [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967441 [12:34:50] (03CR) 10Hashar: "The rationale is in the task, specially T146293#2658764" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967440 (https://phabricator.wikimedia.org/T146293) (owner: 10Hashar) [12:35:58] (03CR) 10Hashar: "I find it convenient to run pytest then refresh the generated HTML coverage report to keep track of my progress toward covering code. Bra" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967441 (owner: 10Hashar) [12:37:01] (03CR) 10CI reject: [V: 04-1] Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [12:40:03] (03CR) 10Hashar: Add a json representation of the build (033 comments) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [12:44:21] (03PS3) 10Jbond: cassandra::instance: update to use wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) [12:49:33] (03PS1) 10Elukey: ml-services: set OMP_NUM_THREADS in all revertrisk isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967442 (https://phabricator.wikimedia.org/T347550) [12:55:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:55:33] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) To debug this issue i did the following ` --- /usr/lib/ruby/vendor_ruby/puppet/network/format_support.rb 2023-10... [12:55:56] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: update reload endpoint to reflect updated web prefix [puppet] - 10https://gerrit.wikimedia.org/r/967291 (https://phabricator.wikimedia.org/T349102) (owner: 10Herron) [12:56:47] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: follow up on Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967426 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [12:57:19] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: set OMP_NUM_THREADS in all revertrisk isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967442 (https://phabricator.wikimedia.org/T347550) (owner: 10Elukey) [12:58:36] (03PS4) 10Jbond: cassandra::instance: update to use wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) [12:59:48] 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [13:00:35] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [13:01:19] (03CR) 10Btullis: "Looks good. I agree that this should be data-engineering team instead of sre team, but this all falls under the T345698 epic anyway." [alerts] - 10https://gerrit.wikimedia.org/r/967409 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:03:34] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:35] (03PS2) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [13:05:13] (03PS4) 10Jbond: wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) [13:05:15] (03PS5) 10Jbond: cassandra::instance: update to use wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) [13:05:33] (03CR) 10Vgutierrez: "looking good, I guess common.yaml data needs to be completed for the whole set of lvs" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [13:06:13] (03PS1) 10JMeybohm: CI: Properly detect changes to link targets in helmfile.d/*services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967445 (https://phabricator.wikimedia.org/T300033) [13:07:25] (03CR) 10CI reject: [V: 04-1] CI: Properly detect changes to link targets in helmfile.d/*services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967445 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:07:54] (03PS1) 10Jbond: add bin file [labs/private] - 10https://gerrit.wikimedia.org/r/967446 [13:08:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:52] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/117/console" [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:09:21] (03CR) 10CI reject: [V: 04-1] wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [13:12:06] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1300" [extensions/CentralAuth] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967195 (https://phabricator.wikimedia.org/T348852) (owner: 10Bartosz Dziewoński) [13:12:26] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [13:12:32] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) (owner: 10Bartosz Dziewoński) [13:12:38] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [13:13:15] (03PS2) 10JMeybohm: CI: Properly detect changes to link targets in helmfile.d/*services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967445 (https://phabricator.wikimedia.org/T300033) [13:13:17] (03PS1) 10JMeybohm: Remove ores listener from mediawiki fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/967447 (https://phabricator.wikimedia.org/T347278) [13:13:19] (03PS3) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [13:14:19] (03PS1) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [13:14:59] (03CR) 10Bartosz Dziewoński: "This needs to be backported, because the data it reads and writes is shared among all wikis, and when we have both wmf.1 and wmf.2 in prod" [extensions/CentralAuth] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967195 (https://phabricator.wikimedia.org/T348852) (owner: 10Bartosz Dziewoński) [13:16:00] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) 05Open→03Resolved @MatthewVernon reimaged with instructions from papaul If you need anything else let me know [13:16:26] (03CR) 10JMeybohm: "Welcome back 😋" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967445 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:19:05] !log Disabling BGP from ssw1-e1-eqiad to cr1-eqiad to move BGP peers to new group T349125 [13:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:12] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [13:19:27] (03CR) 10Elukey: [C: 03+2] profile::thanos: follow up on Istio SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/967426 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [13:19:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:21:57] (03CR) 10Elukey: [C: 03+1] Remove ores listener from mediawiki fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/967447 (https://phabricator.wikimedia.org/T347278) (owner: 10JMeybohm) [13:23:14] (03PS3) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [13:23:16] (03PS2) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [13:23:45] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10MatthewVernon) Great, thank you :-) [13:25:23] (03CR) 10David Caro: [C: 03+1] "We should merge this before upgrading tools (and remove the cherry-pick from toolsbeta)" [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova) [13:26:14] (03PS5) 10Jbond: wmflib: update secret to only use binary on puppet6+ [puppet] - 10https://gerrit.wikimedia.org/r/967421 (https://phabricator.wikimedia.org/T349291) [13:26:16] (03PS6) 10Jbond: cassandra::instance: update to use wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/967422 (https://phabricator.wikimedia.org/T349291) [13:27:52] (03CR) 10Elukey: [C: 03+2] ml-services: set OMP_NUM_THREADS in all revertrisk isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967442 (https://phabricator.wikimedia.org/T347550) (owner: 10Elukey) [13:28:16] (03CR) 10David Caro: [C: 03+1] "LGTM, is there any way to get the output of the command?" [puppet] - 10https://gerrit.wikimedia.org/r/967219 (https://phabricator.wikimedia.org/T347683) (owner: 10Majavah) [13:28:45] (03CR) 10David Caro: [C: 03+1] kubeadm: drop version upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/966864 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah) [13:28:47] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:29:07] (03PS1) 10Jbond: sretest: Add file to test binary content [puppet] - 10https://gerrit.wikimedia.org/r/967450 (https://phabricator.wikimedia.org/T349291) [13:29:10] (03CR) 10David Caro: [C: 03+1] kubeadm: drop default [puppet] - 10https://gerrit.wikimedia.org/r/966863 (owner: 10Majavah) [13:29:22] (03CR) 10Elukey: ml-services: deploy new Bullseye version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) (owner: 10Ilias Sarantopoulos) [13:29:33] (03CR) 10CI reject: [V: 04-1] sretest: Add file to test binary content [puppet] - 10https://gerrit.wikimedia.org/r/967450 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [13:29:40] (03CR) 10Jbond: [C: 03+2] sretest: Add file to test binary content [puppet] - 10https://gerrit.wikimedia.org/r/967450 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [13:30:33] (03CR) 10David Caro: [C: 03+1] P:toolforge: provision root sudo policy via here [puppet] - 10https://gerrit.wikimedia.org/r/965400 (owner: 10Majavah) [13:30:44] (03PS2) 10Jbond: sretest: Add file to test binary content [puppet] - 10https://gerrit.wikimedia.org/r/967450 (https://phabricator.wikimedia.org/T349291) [13:30:58] (03PS4) 10David Caro: P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [13:31:38] (03CR) 10David Caro: [C: 03+1] "I think I forgot to +1 xd" [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [13:32:58] (03CR) 10JMeybohm: [C: 03+2] Remove ores listener from mediawiki fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/967447 (https://phabricator.wikimedia.org/T347278) (owner: 10JMeybohm) [13:33:30] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:33:44] (03CR) 10Jbond: [C: 03+2] sretest: Add file to test binary content [puppet] - 10https://gerrit.wikimedia.org/r/967450 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [13:33:56] (03CR) 10Majavah: [V: 03+1 C: 03+2] kubeadm: drop default [puppet] - 10https://gerrit.wikimedia.org/r/966863 (owner: 10Majavah) [13:34:14] (03Merged) 10jenkins-bot: Remove ores listener from mediawiki fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/967447 (https://phabricator.wikimedia.org/T347278) (owner: 10JMeybohm) [13:34:17] (03CR) 10Majavah: [C: 03+2] kubeadm: drop version upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/966864 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah) [13:34:29] taavi: happy for m to merge your kubeadm cr [13:34:35] jbond: please do [13:35:04] done [13:36:02] (03CR) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova) [13:38:13] (03PS1) 10Jbond: sretest: drop user group [puppet] - 10https://gerrit.wikimedia.org/r/967452 [13:38:27] (03CR) 10Jbond: [C: 03+2] sretest: drop user group [puppet] - 10https://gerrit.wikimedia.org/r/967452 (owner: 10Jbond) [13:39:51] (03PS2) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) [13:46:11] (03Abandoned) 10Ssingh: hiera: add host override for cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/967239 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [13:47:37] (03PS5) 10Brouberol: Monitor the expiration date of the skein x509 certificates [alerts] - 10https://gerrit.wikimedia.org/r/967409 (https://phabricator.wikimedia.org/T329398) [13:47:58] (03PS1) 10Ssingh: hiera: add host override for cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/967457 (https://phabricator.wikimedia.org/T349244) [13:48:22] (03CR) 10Elukey: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967456 (owner: 10Klausman) [13:49:06] (03CR) 10Elukey: [C: 03+1] "LGTM, let's also wait for Janis' +1" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967456 (owner: 10Klausman) [13:49:13] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/120/con" [puppet] - 10https://gerrit.wikimedia.org/r/967457 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [13:50:01] (03CR) 10Ssingh: [C: 04-2] "PCC test, do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/967457 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [13:50:03] (03CR) 10JMeybohm: [C: 03+1] "Totally missed that. LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967456 (owner: 10Klausman) [13:50:23] (03CR) 10Klausman: [V: 03+2 C: 03+2] images: Fix missing -1 tag on golang 1.21 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967456 (owner: 10Klausman) [13:50:35] (03CR) 10Brouberol: "I have addressed your suggestions!" [alerts] - 10https://gerrit.wikimedia.org/r/967409 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [13:50:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) @Jclark-ctr these hosts are for a new proof-of-concept cloud openstack deployment. As such the [[ https://wikitech.wi... [13:51:32] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10VRiley-WMF) Adjusted power connections to free up load on PDU. PDU should be stable. [13:52:04] (03Abandoned) 10Ssingh: hiera: add host override for cp1077 [puppet] - 10https://gerrit.wikimedia.org/r/967457 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [13:52:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) It seems that this issues is more an issue with pcc which uses json but its not an issue for actually catalogue ru... [13:52:17] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [13:53:50] (03CR) 10Ssingh: [C: 03+1] "Verified this with existing hosts in text and upload to ensure that the single backend configuration is correctly applied." [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [13:55:18] (03PS4) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [13:55:20] (03PS3) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [13:57:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) >>! In T342455#9268504, @cmooney wrote: > @Jclark-ctr these hosts are for a new proof-of-concept cloud openstack deplo... [13:58:02] (03CR) 10Ssingh: [C: 03+1] hiera: enable dual disk storage for new cp hosts in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [14:00:12] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:43] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [14:00:45] (03CR) 10David Caro: [C: 03+1] "yep, once we moved out of hardware this was not needed." [puppet] - 10https://gerrit.wikimedia.org/r/961787 (owner: 10Majavah) [14:01:07] (03CR) 10Majavah: [C: 03+2] toolsdb_replica_cnf: Remove firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961787 (owner: 10Majavah) [14:01:27] (03CR) 10David Caro: [C: 03+1] P:toolforge: docker: enable --delete for the registry rsync [puppet] - 10https://gerrit.wikimedia.org/r/937454 (owner: 10Majavah) [14:02:59] (03CR) 10David Caro: [C: 04-1] "We moved this to gitlab, I think I missed it when I did the move, can you migrate it if needed? thanks!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/929009 (owner: 10Majavah) [14:04:25] (03CR) 10Klausman: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (owner: 10Klausman) [14:04:49] (03PS5) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [14:04:51] (03PS4) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [14:04:56] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge: docker: enable --delete for the registry rsync [puppet] - 10https://gerrit.wikimedia.org/r/937454 (owner: 10Majavah) [14:07:31] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) This unblocked this issue and made tox pass: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/967244... [14:10:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [14:13:24] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/967425 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [14:13:26] RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:44] (03PS6) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [14:15:46] (03PS5) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [14:18:16] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10VRiley-WMF) a:03VRiley-WMF [14:21:05] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [14:21:26] (03PS7) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [14:21:28] (03PS6) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [14:23:08] (03PS1) 10Jbond: puppet_compiler: add hack to fall back to pson [puppet] - 10https://gerrit.wikimedia.org/r/967464 (https://phabricator.wikimedia.org/T349291) [14:23:55] (03CR) 10Elukey: "All the other images in the kserve dir are missing, let's do all of them to make sure everything builds fine rather than one at the time." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (owner: 10Klausman) [14:24:21] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add hack to fall back to pson [puppet] - 10https://gerrit.wikimedia.org/r/967464 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [14:24:26] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [14:27:44] (03PS1) 10Jbond: Revert "sretest: Add file to test binary content" [puppet] - 10https://gerrit.wikimedia.org/r/967203 [14:27:59] (03CR) 10CI reject: [V: 04-1] Revert "sretest: Add file to test binary content" [puppet] - 10https://gerrit.wikimedia.org/r/967203 (owner: 10Jbond) [14:28:24] (03PS1) 10Jbond: puppet_compiler: add missing file [puppet] - 10https://gerrit.wikimedia.org/r/967465 [14:28:54] (03CR) 10CI reject: [V: 04-1] puppet_compiler: add missing file [puppet] - 10https://gerrit.wikimedia.org/r/967465 (owner: 10Jbond) [14:30:11] (03PS2) 10Jbond: Revert "sretest: Add file to test binary content" [puppet] - 10https://gerrit.wikimedia.org/r/967203 [14:30:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: add missing file [puppet] - 10https://gerrit.wikimedia.org/r/967465 (owner: 10Jbond) [14:31:14] (03CR) 10Jbond: [C: 03+2] Revert "sretest: Add file to test binary content" [puppet] - 10https://gerrit.wikimedia.org/r/967203 (owner: 10Jbond) [14:32:04] (03PS1) 10Hnowlan: thumbor: pin image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/967466 (https://phabricator.wikimedia.org/T348856) [14:32:15] (03PS8) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [14:32:17] (03PS7) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [14:32:41] (03CR) 10CI reject: [V: 04-1] Revert "sretest: Add file to test binary content" [puppet] - 10https://gerrit.wikimedia.org/r/967203 (owner: 10Jbond) [14:33:24] !log Disabling BGP from ssw1-f1-eqiad to cr2-eqiad to move BGP peers to new group T349125 [14:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:35] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [14:37:12] PROBLEM - BGP status on ssw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:37:25] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10fnegri) Unfortunately, I think this specific bug still exists, because there's no Python 3.11 wheel in PyPI: https:... [14:38:07] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10VRiley-WMF) @ayounsi Racked a test server (cloudvirt1024) into E8 as requested. [14:38:36] RECOVERY - BGP status on ssw1-f1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:38:38] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:24] (03PS1) 10Jbond: sretest: remove old test [puppet] - 10https://gerrit.wikimedia.org/r/967467 [14:41:26] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [14:42:06] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967467 (owner: 10Jbond) [14:42:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/126/console" [puppet] - 10https://gerrit.wikimedia.org/r/967467 (owner: 10Jbond) [14:43:35] (03PS3) 10Ilias Sarantopoulos: ml-services: deploy new Bullseye version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) [14:44:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/127/con" [puppet] - 10https://gerrit.wikimedia.org/r/967467 (owner: 10Jbond) [14:48:21] (03CR) 10Elukey: "Checked all docker images versions, they look good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) (owner: 10Ilias Sarantopoulos) [14:53:38] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:32] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) yep, I tried a few combinations of elasticsearch-curator/pyyaml and such... it turns out that elasticseach-c... [14:57:42] !log Disabling BGP from asw1-b12-drmrs to cr1-drmrs to move BGP peers to new group T349125 [14:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:46] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [14:58:46] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS4265006001/IPv4: Idle - asw1-b12-drmrs, AS4265006001/IPv6: Idle - asw1-b12-drmrs https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:23] ^^^ sry, this was me, its back up will clear now [15:01:29] about to do the same for cr2 [15:02:54] topranks: I figured :) [on-call so...] [15:03:17] sukhe: sorry! did those page? [15:03:34] not at all [15:04:01] I meant I am keeping track because of that and also because I have a highlight for BGP CRITICAL :P [15:05:24] I'll downtime anyway [15:05:39] nice touch with the highlight btw :) [15:07:24] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [15:08:36] !ack 4136 [15:08:37] 4136 (ACKED) kafka-jumbo1001/Kafka Broker Server (paged) [15:08:40] Here. [15:08:41] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: changing bgp config on drmrs switches [15:08:52] here [15:09:01] !log Disabling BGP from asw1-b12-drmrs to cr2-drmrs to move BGP peers to new group T349125 [15:09:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: changing bgp config on drmrs switches [15:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [15:09:07] did someone ACK it? I didn't get any page [15:09:10] this might be from yesterday? [15:09:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b09e42f6-6ad2-4453-abab-27f0a3934508) set by... [15:09:18] !incidents [15:09:19] 4137 (ACKED) kafka-jumbo1002/Kafka Broker Server (paged) [15:09:19] 4138 (ACKED) kafka-jumbo1003/Kafka Broker Server (paged) [15:09:19] 4139 (RESOLVED) kafka-jumbo1004/Kafka Broker Server (paged) [15:09:25] sukhe: Yes, I've already ACK'd it. [15:09:30] denisse: sorry just saw [15:09:33] I think this is from yesterday [15:09:36] I just got p.aged re 4136 again [15:09:46] Yes, it looks like it's from yesterday. 🤔 [15:10:14] !ack 4137 [15:10:15] 4137 (ACKED) kafka-jumbo1002/Kafka Broker Server (paged) [15:10:32] marking as resolved [15:10:35] yeah, looks like yesterday's p.age acks are all about to expire [15:11:00] marked all as resolved [15:11:02] !incidents [15:11:03] 4138 (RESOLVED) kafka-jumbo1003/Kafka Broker Server (paged) [15:11:03] 4139 (RESOLVED) kafka-jumbo1004/Kafka Broker Server (paged) [15:11:23] (03PS1) 10Jforrester: [Staging only] wikifunctions: Also over-ride the Python evaluator to WASM and bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/967471 [15:11:27] Is this all because the nodes weren't downtimed before yesterday's decommissioning? [15:11:33] Emperor: yeah [15:11:46] and we ACKed the page but never actually resolved them [15:12:02] le sigh [15:12:08] the hosts are still downtimed though [15:12:46] Emperor: could be worse, could be an actual page :P [15:13:08] !log Disabling BGP from asw1-b13-drmrs to cr2-drmrs to move BGP peers to new group T349125 [15:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:29] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Also over-ride the Python evaluator to WASM and bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/967471 (owner: 10Jforrester) [15:13:47] lol, fair enough [15:14:24] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Also over-ride the Python evaluator to WASM and bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/967471 (owner: 10Jforrester) [15:14:59] I notice there are some more kafka silences that expire in 6 minutes [15:15:23] checking [15:15:28] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) [15:15:33] oh uh right [15:15:37] brouberol: ^ [15:15:41] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:15:42] (I find the alerts.wm.org UI quite confusing for this) [15:16:06] brouberol: should we extend the downtime? it might page again otherwise [15:16:33] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:17:12] er wait [15:17:22] sorry, it's been downtimed till November 2 [15:17:24] we are good here! [15:18:19] Emperor: I checked the downtiming on Icinga [15:18:40] yes please [15:18:51] I'm going to decom the underlying nodes on monday [15:18:58] !log Disabling BGP from asw1-b13-drmrs to cr1-drmrs to move BGP peers to new group T349125 [15:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:07] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [15:19:07] and set a 14d icinga downtime [15:19:58] thanks, that was a mistake on my part. it is indeed a 14d downtime [15:21:22] I think I may not have entirely grokked the icinga / alertmanager interplay yet [15:25:42] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: changing bgp config on esams switches [15:26:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: changing bgp config on esams switches [15:26:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6731cf5b-8a4f-4391-98fa-2900d5500bf5) set by... [15:36:02] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issues - https://phabricator.wikimedia.org/T349291 (10jbond) > graphite & cloudmetrics I have tracked this down to the `configparser_format` function [15:36:30] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@fd88cfa]: Update kafka hosts mjolnir communicates with [15:36:58] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@fd88cfa]: Update kafka hosts mjolnir communicates with (duration: 00m 27s) [15:39:44] !log Disabling BGP from asw1-bw27-esams to cr1-esams to move BGP peers to new group T349125 [15:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:49] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [15:41:09] (03CR) 10Xcollazo: [C: 03+1] Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:42:03] (03PS20) 10Brouberol: Define environment variables to ease the use of prometheus-metricsfetcher [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) [15:45:16] (03CR) 10JMeybohm: [C: 04-1] thumbor: pin image versions (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967466 (https://phabricator.wikimedia.org/T348856) (owner: 10Hnowlan) [15:45:21] (03CR) 10Xcollazo: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:47:54] !log Disabling BGP from asw1-bw27-esams to cr2-esams to move BGP peers to new group T349125 [15:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:59] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [15:48:34] (03PS1) 10JMeybohm: Update similar-users to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967473 (https://phabricator.wikimedia.org/T300033) [15:51:09] (03PS1) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [15:55:51] !log Disabling BGP from asw1-by27-esams to cr2-esams to move BGP peers to new group T349125 [15:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:56] T349125: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 [15:56:18] (03PS2) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [15:56:57] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [15:59:32] !log Disabling BGP from asw1-by27-esams to cr1-esams to move BGP peers to new group T349125 [15:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:32] (03PS5) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) [16:02:46] (03PS3) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:03:29] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:03:36] (03PS4) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:03:39] (03PS7) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [16:04:13] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:06:07] (03PS5) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:06:24] (03PS8) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [16:06:44] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:07:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:07:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/133/console" [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:09:05] (03PS6) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:09:20] has anyone looked a the puppet agent run failure on lists1003? [16:09:24] https://puppetboard.wikimedia.org/node/lists1003.wikimedia.org [16:09:41] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:10:19] > Can't connect to MySQL server on 'm5-master.eqiad.wmnet [16:11:17] (03PS7) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:11:55] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:18:09] (03PS8) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:18:46] (03CR) 10CI reject: [V: 04-1] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:19:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/136/console" [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:20:54] (03PS9) 10Jbond: graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) [16:22:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/137/console" [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:23:54] (03CR) 10Jbond: [V: 03+1] "unfortunately git lost the diff's however i have not changed the rmegre or format functions (other then to rename the later)" [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:38:35] (03CR) 10Subramanya Sastry: [C: 03+1] "I have left a comment headline nitpick for consideration. But, in any case, this is likely going to be deployed next week during a backpor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [16:42:22] (03PS1) 10Vgutierrez: acme_chief: Disable proxy buffering on nginx [puppet] - 10https://gerrit.wikimedia.org/r/967477 (https://phabricator.wikimedia.org/T349384) [16:47:04] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10aaron) That sounds consistent with the operations failing in one data-center... [16:54:13] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) [16:55:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:57:28] (03PS1) 10Hashar: Add a json representation for each host [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 [17:09:14] (03CR) 10Santiago Faci: [C: 03+1] "It looks good!, media-analytics just passed the QA test so it's ready to be redeployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/967438 (https://phabricator.wikimedia.org/T347899) (owner: 10Hnowlan) [17:25:05] (03CR) 10Ssingh: [C: 03+1] acme_chief: Disable proxy buffering on nginx [puppet] - 10https://gerrit.wikimedia.org/r/967477 (https://phabricator.wikimedia.org/T349384) (owner: 10Vgutierrez) [17:28:23] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967482 (https://phabricator.wikimedia.org/T347075) [17:28:50] (03PS4) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [17:28:52] (03PS2) 10Hashar: Add a json representation for each host [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 [17:28:54] (03PS1) 10Hashar: tox: flake8 exclude build and venv directories [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967483 [17:29:58] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967482 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [17:30:52] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967482 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [17:36:04] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:36:17] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:59:03] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) [17:59:20] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) [18:14:49] (03PS1) 10Bking: rdf-streaming-updater: use docker img flink-1.16.1-rdf-0.3.136 [deployment-charts] - 10https://gerrit.wikimedia.org/r/967485 (https://phabricator.wikimedia.org/T349147) [18:17:55] (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater: use docker img flink-1.16.1-rdf-0.3.136 [deployment-charts] - 10https://gerrit.wikimedia.org/r/967485 (https://phabricator.wikimedia.org/T349147) (owner: 10Bking) [18:25:31] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: use docker img flink-1.16.1-rdf-0.3.136 [deployment-charts] - 10https://gerrit.wikimedia.org/r/967485 (https://phabricator.wikimedia.org/T349147) (owner: 10Bking) [18:26:08] (03PS1) 10Hashar: Fix HTML index title and make titles concises [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967506 [18:37:01] (03PS1) 10Ebernhardson: cirrus updater: Ensure all output topics exist [deployment-charts] - 10https://gerrit.wikimedia.org/r/967507 (https://phabricator.wikimedia.org/T347075) [18:40:18] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Ensure all output topics exist [deployment-charts] - 10https://gerrit.wikimedia.org/r/967507 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [18:41:05] (03Merged) 10jenkins-bot: cirrus updater: Ensure all output topics exist [deployment-charts] - 10https://gerrit.wikimedia.org/r/967507 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [18:41:51] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:41:55] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:42:40] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:43:06] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:45:52] (03PS1) 10Ebernhardson: cirrus updater: Revert staging to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967508 [18:46:55] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Revert staging to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967508 (owner: 10Ebernhardson) [18:47:34] (03Merged) 10jenkins-bot: cirrus updater: Revert staging to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967508 (owner: 10Ebernhardson) [18:49:21] (03PS1) 10Ebernhardson: cirrus updater: Output topic must be suffixed with version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/967510 [18:53:11] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Output topic must be suffixed with version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/967510 (owner: 10Ebernhardson) [18:53:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:53:58] (03Merged) 10jenkins-bot: cirrus updater: Output topic must be suffixed with version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/967510 (owner: 10Ebernhardson) [18:56:46] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:57:05] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:00:15] (03PS1) 10Ebernhardson: cirrus updater: Create the codfw update topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967513 [19:02:48] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Create the codfw update topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967513 (owner: 10Ebernhardson) [19:03:31] (03Merged) 10jenkins-bot: cirrus updater: Create the codfw update topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967513 (owner: 10Ebernhardson) [19:03:34] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump WASM evaluators again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967514 [19:05:06] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump WASM evaluators again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967514 (owner: 10Jforrester) [19:05:06] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:05:12] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [19:05:18] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:05:50] (03PS1) 10Ebernhardson: cirrus updater: Repoint staging back to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967515 [19:05:57] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump WASM evaluators again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967514 (owner: 10Jforrester) [19:06:26] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:06:58] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:07:35] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:07:39] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:07:55] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:12:38] (03PS8) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [19:15:08] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:20:21] (03PS9) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [19:22:53] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:22:59] (03PS1) 10Dwisehaupt: Add dummy db password for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) [19:28:51] (03PS2) 10Ebernhardson: cirrus updater: Repoint staging back to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967515 [19:28:53] (03PS1) 10Ebernhardson: cirrus updater: Provide elasticsearch routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/967520 (https://phabricator.wikimedia.org/T347075) [19:29:45] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Repoint staging back to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967515 (owner: 10Ebernhardson) [19:30:43] (03Merged) 10jenkins-bot: cirrus updater: Repoint staging back to expected output topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/967515 (owner: 10Ebernhardson) [19:32:20] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Provide elasticsearch routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/967520 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [19:33:07] (03Merged) 10jenkins-bot: cirrus updater: Provide elasticsearch routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/967520 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [19:33:49] (03PS10) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [19:35:01] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:35:37] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:36:24] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:39:51] (03PS11) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [19:40:30] (03CR) 10Cwhite: "Based on my understanding, this will work against both ElasticSearch (7.x) and OpenSearch (1.x). See inline for note about OpenSearch 2.x" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [19:40:54] (03PS1) 10Ebernhardson: cirrus updater: http routes must not be suffixed with / [deployment-charts] - 10https://gerrit.wikimedia.org/r/967522 [19:41:28] (03PS1) 10Bking: relforge: Allow traffic from staging wikikube [puppet] - 10https://gerrit.wikimedia.org/r/967523 (https://phabricator.wikimedia.org/T347075) [19:42:02] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: http routes must not be suffixed with / [deployment-charts] - 10https://gerrit.wikimedia.org/r/967522 (owner: 10Ebernhardson) [19:42:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967523 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:42:44] (03Merged) 10jenkins-bot: cirrus updater: http routes must not be suffixed with / [deployment-charts] - 10https://gerrit.wikimedia.org/r/967522 (owner: 10Ebernhardson) [19:44:47] (03CR) 10Ebernhardson: [C: 03+1] relforge: Allow traffic from staging wikikube [puppet] - 10https://gerrit.wikimedia.org/r/967523 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:44:58] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:45:09] (03CR) 10Bking: [C: 03+2] relforge: Allow traffic from staging wikikube [puppet] - 10https://gerrit.wikimedia.org/r/967523 (https://phabricator.wikimedia.org/T347075) (owner: 10Bking) [19:45:11] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:02:32] (03PS1) 10Ebernhardson: cirrus updater: Disable the all-matching http route in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967526 (https://phabricator.wikimedia.org/T347075) [20:04:16] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Disable the all-matching http route in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967526 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [20:04:58] (03Merged) 10jenkins-bot: cirrus updater: Disable the all-matching http route in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/967526 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [20:05:14] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:05:33] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:07:16] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:18:29] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump WASM Py evaluator again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967527 [20:18:48] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump WASM Py evaluator again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967527 (owner: 10Jforrester) [20:19:43] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump WASM Py evaluator again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967527 (owner: 10Jforrester) [20:19:51] !log brion running requeueTranscodes.php on mwmaint2002 for audio and video transcode backfill, will use some jobqueue cpu but should be nicely throttled [20:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:12] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:20:51] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:21:28] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:50] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:40] (03PS1) 10Brion VIBBER: "Soft-launch" iOS-compatible HLS video transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967531 (https://phabricator.wikimedia.org/T68722) [20:55:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:56:44] (03CR) 10Brion VIBBER: "adding some pals as reviewers ;) no rush on this, friday is not ideal time :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967531 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [21:03:45] (03PS1) 10Jforrester: [Staging only] wikifunctions: Switch Py WASM image to one with cache disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/967532 [21:04:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:28] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Switch Py WASM image to one with cache disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/967532 (owner: 10Jforrester) [21:05:16] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Switch Py WASM image to one with cache disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/967532 (owner: 10Jforrester) [21:05:18] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:18] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:06:57] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:08:04] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:02] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:52] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:44] (03PS1) 10Ebernhardson: cirrus updater: point all three staging elastic clusters at relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/967535 [21:34:35] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: point all three staging elastic clusters at relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/967535 (owner: 10Ebernhardson) [21:35:26] (03Merged) 10jenkins-bot: cirrus updater: point all three staging elastic clusters at relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/967535 (owner: 10Ebernhardson) [21:38:55] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:39:26] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:06:23] (03PS2) 10Gergő Tisza: CentralAuth: Use second-level domain for cookies for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) [22:07:01] (03CR) 10Gergő Tisza: CentralAuth: Use second-level domain for cookies for Wikifunctions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:10:28] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) Today's the first day I don't notice any error on large imports (~1000 files) so the iss... [22:44:22] (03CR) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:53:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:20:03] 10SRE, 10Wikimedia-Site-requests, 10Wikimedia-maintenance-script-run, 10Wiktionary-fr: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112 (10Pppery) [23:51:48] (03PS3) 10Krinkle: logging: Remove redundant setTimezone() call for UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) [23:51:51] (03PS9) 10Krinkle: logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) [23:52:20] * Krinkle testing on mwdebug2002 [23:57:57] (03CR) 10Krinkle: [C: 03+2] logging: Remove redundant setTimezone() call for UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) (owner: 10Krinkle) [23:57:59] (03CR) 10Krinkle: [C: 03+2] logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle) [23:58:39] (03Merged) 10jenkins-bot: logging: Remove redundant setTimezone() call for UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) (owner: 10Krinkle) [23:58:43] (03Merged) 10jenkins-bot: logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle)