[00:01:48] (03PS1) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 [00:18:01] (03PS2) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 [00:23:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:25:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:34:06] (03PS3) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 [00:42:05] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:15] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:52] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:52] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:52] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:57] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [01:55:55] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [02:01:49] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:46] (03PS5) 10Gergő Tisza: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) [02:07:11] (03PS6) 10Gergő Tisza: [WIP] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) [02:08:52] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:45] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:58:35] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:01:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:02:03] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:18:33] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:19:13] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:20:47] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:27:49] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:43:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:48] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:02:40] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:15:48] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:17:44] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:34:32] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:39:18] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:43:24] (03Abandoned) 10Andrea Denisse: netmon: Add the netmon role to netmon2002 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (owner: 10Andrea Denisse) [05:40:08] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:12:06] (03PS4) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 [06:17:30] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38034/console" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (owner: 10Andrea Denisse) [06:22:44] (03CR) 10Andrea Denisse: "Hello team, here are the PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/38034/" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (owner: 10Andrea Denisse) [06:33:24] PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [06:41:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11404 [06:42:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11404 [06:44:43] (03PS1) 10David Caro: labs: Add header and footer comment to avoid git conflicts [labs/private] - 10https://gerrit.wikimedia.org/r/854870 [06:53:52] (03CR) 10Ayounsi: P:netbox::host: create a motd for the status (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [06:55:36] (03CR) 10Ayounsi: [C: 03+1] dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [06:56:49] (03CR) 10Ayounsi: [C: 03+2] Rename Telia to Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/829558 (owner: 10Ayounsi) [06:56:53] (03PS4) 10Abijeet Patro: Enable logging for UpdateMessageBundleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) [06:58:20] (03PS5) 10Abijeet Patro: Add channel for MessageBundle feature of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) [06:58:31] (03CR) 10Abijeet Patro: Add channel for MessageBundle feature of Translate extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro) [07:02:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6774 [07:03:30] (03PS1) 10Andrea Denisse: netmon: Remove netmon1002 from alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) [07:04:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6774 [07:04:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8220 [07:05:06] (03PS1) 10Majavah: P:openstack: explicit rules for haproxy backend traffic POC [puppet] - 10https://gerrit.wikimedia.org/r/854875 [07:05:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8220 [07:06:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 61955 [07:06:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61955 [07:07:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30990 [07:07:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38035/console" [puppet] - 10https://gerrit.wikimedia.org/r/854875 (owner: 10Majavah) [07:08:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30990 [07:08:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37613 [07:08:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37613 [07:08:25] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38036/console" [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [07:08:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29169 [07:09:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29169 [07:09:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15412 [07:10:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15412 [07:10:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3225 [07:10:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3225 [07:11:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23889 [07:11:21] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 23889 [07:11:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23889 [07:11:39] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 23889 [07:14:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23889 [07:14:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23889 [07:15:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37693 [07:16:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37693 [07:16:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 54994 [07:17:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 54994 [07:18:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8309 [07:19:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8309 [07:19:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29608 [07:20:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29608 [07:20:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37662 [07:20:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37662 [07:21:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37271 [07:21:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37271 [07:22:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8218 [07:22:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8218 [07:22:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6461 [07:24:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6461 [07:26:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:27:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:33:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:36:19] <_joe_> uhm [07:36:24] <_joe_> lists down again? [07:41:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:53:40] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/38036/" [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [07:56:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.760 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:58:48] (03Abandoned) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [08:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T0800). [08:00:05] phuedx and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] Hello o/ [08:00:18] (03PS1) 10Elukey: toil::ganeti_ifupdown: fix systemctl path [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) [08:01:25] (03CR) 10Elukey: "ah no wait it is only on buster that systemctl is on /bin, sigh.. fixing" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey) [08:02:13] * kart_ is here [08:03:18] (03CR) 10Muehlenhoff: "On Buster and later /bin is a symlink to /usr/bin, so the net effect should be the same. Or did you run into an actual issue here?" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey) [08:03:46] (03PS1) 10Giuseppe Lavagetto: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 [08:04:32] (03PS5) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) [08:04:37] Anyone else around to deploy or should I go ahead with phuedx's patches? [08:05:18] phuedx: Can you rebase the first patch meanwhile? I'm not familiar with that config, so need to make sure rebase is correct :) [08:05:41] (03CR) 10Elukey: toil::ganeti_ifupdown: fix systemctl path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey) [08:06:07] kart_: It would help if I linked to the correct patch :/ One sec! [08:06:13] (03PS3) 10Phuedx: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) [08:06:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS bullseye [08:06:36] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS bullseye [08:06:43] (03CR) 10CI reject: [V: 04-1] CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [08:07:09] kart_: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/854475 is rebased [08:07:15] I've updated the Deployments page on wt [08:07:21] jouncebot refresh [08:07:21] I refreshed my knowledge about deployments. [08:07:55] (03PS6) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) [08:08:22] (03CR) 10Elukey: Upgrade to 1.15.3 (031 comment) [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [08:09:08] phuedx: OK. Checking.. [08:09:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [08:10:27] phuedx: Deploying. Will ping once it is available to test on mwdebug. [08:10:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "if that's not enough, we can try later with more lines of #" [labs/private] - 10https://gerrit.wikimedia.org/r/854870 (owner: 10David Caro) [08:11:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [08:12:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm ok to move this to /bin if it fixes serpens and seaborgium, which I'm assuming we dist-upgraded in place ? Hence the reason why /bin i" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey) [08:12:39] (03Merged) 10jenkins-bot: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [08:12:56] !log kartik@deploy1002 Started scap: Backport for [[gerrit:854475|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] [08:13:00] T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016 [08:13:18] !log kartik@deploy1002 kartik and phuedx: Backport for [[gerrit:854475|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:14:00] phuedx: Available to test on mwdebug1002/2002/1001/2001 Please test. [08:14:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. That is in fact a corner case we need to consider: So all buster systems which were installed with buster use the merged usr s" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey) [08:15:09] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854573 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:16:37] (03CR) 10Muehlenhoff: [C: 03+2] grafana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854573 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:16:48] kart_: LGTM. Verified on testwiki, hewiki, and enwiki that the sampling rates are 1, 1, and 0 respectively. Thanks! [08:17:01] phuedx: cool. deploying.. [08:17:11] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Remove netmon1002 from alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [08:18:02] PROBLEM - MariaDB Replica SQL: s2 #page on db1182 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table geo_tags is corrupt: try to repair it on query. Default database: zhwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:18:18] hello hello [08:18:55] o/ [08:18:58] (03PS6) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) [08:19:16] I'm tempted to run the depool replica runbook here [08:19:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:19:25] XioNoX: I was about to suggest [08:19:33] (03CR) 10Filippo Giunchedi: "My understanding is that we should apply the netmon role to netmon2002 first, or e.g. profile::rancing will fail to rsync to the new passi" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [08:19:35] go for it [08:19:40] is it related to the current work kart_ ? [08:20:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:20:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:20:18] index for table geo_tags is corrupt: try to repair it on query. [08:20:22] doubtful [08:20:45] !log ayounsi@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P38772 and previous config saved to /var/cache/conftool/dbconfig/20221109-082045-ayounsi.json [08:20:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1018.eqiad.wmnet with reason: host reimage [08:21:07] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:854475|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] (duration: 08m 10s) [08:21:11] T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016 [08:21:12] pinging the dba in case they're around Amir1, marostegui [08:21:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:21:36] alright it's depooled, now what? [08:22:33] * Emperor got emailed, is here [08:22:40] phuedx: Over to next patch.. [08:23:10] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for prometheus-ganeti-exporter [puppet] - 10https://gerrit.wikimedia.org/r/854578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:23:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1018.eqiad.wmnet with reason: host reimage [08:23:36] <_joe_> uh I just got paged [08:23:40] <_joe_> !incidents [08:23:40] 3147 (ACKED) db1182 (paged)/MariaDB Replica SQL: s2 (paged) [08:23:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) (owner: 10Phuedx) [08:23:42] (03CR) 10Elukey: [C: 03+2] toil::ganeti_ifupdown: fix systemctl path [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey) [08:23:43] .me too [08:23:44] XioNoX: I'm off today and not next to my computer [08:23:45] yeah I forgot to ack [08:23:50] you can just depooo it [08:23:55] and create a task [08:23:59] forgot to ack [08:23:59] marostegui: thx, go back to vacations [08:24:03] <_joe_> depoo is the best neologism [08:24:10] already depooled [08:24:12] XioNoX: I think once depooled, oh, what maros.tegui said quicker than me [08:24:15] <_joe_> depoo that server please [08:24:35] just downtime it for like 2 days and create a task [08:24:39] Emperor: thx! will the page resolve or we keep it as ACKed? [08:24:39] (03Merged) 10jenkins-bot: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) (owner: 10Phuedx) [08:24:39] so we can get to it [08:24:44] sounds good! [08:24:52] !log kartik@deploy1002 Started scap: Backport for [[gerrit:852838|Update Metrics Platform streams (T322277)]] [08:24:56] T322277: Generate Edit Attempt test data - https://phabricator.wikimedia.org/T322277 [08:24:56] XioNoX: good question :-/ [08:24:58] thx! [08:25:07] thanks XioNoX [08:25:12] !log kartik@deploy1002 kartik and phuedx: Backport for [[gerrit:852838|Update Metrics Platform streams (T322277)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:25:24] XioNoX: I think add downtime for a couple of days to make sure it doesn't p.age again [08:25:36] phuedx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/852838 available to test on mwdebug [08:25:43] marostegui: enough communicate, you can go back to aviate and navigate [08:25:53] Emperor: yep, on it! thx [08:26:06] kart_: Looking [08:26:15] Amir.1 is I think in today, so I suspect he'll have a look when it reaches the right timezone :) [08:27:17] (03Restored) 10Andrea Denisse: netmon: Add the netmon role to netmon2002 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (owner: 10Andrea Denisse) [08:27:32] XioNoX: just looking at this and lacking scroll back but if the question was about Splunk I’d resolve it there, otherwise I think itll refire in 24h [08:28:16] kart_: LGTM. I double-checked configurations on testwiki and hewiki [08:29:14] phuedx: cool. Deploying.. [08:30:35] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1182.eqiad.wmnet with reason: paged then depooled [08:30:37] (03PS1) 10Volans: netbox: restore 1D TTL on the dyna CNAME [dns] - 10https://gerrit.wikimedia.org/r/854945 (https://phabricator.wikimedia.org/T322700) [08:30:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1182.eqiad.wmnet with reason: paged then depooled [08:31:05] (03PS1) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. Bug: T315523 Change-Id: I4d1e56b42486bbaafc96248fd0d4871555e64d2d [puppet] - 10https://gerrit.wikimedia.org/r/854946 (https://phabricator.wikimedia.org/T315523) [08:31:15] (03PS1) 10Elukey: aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193) [08:31:16] sobanski: thx, resolved! [08:31:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:31:31] https://phabricator.wikimedia.org/T322720 is ready for DBAs [08:31:40] (03CR) 10CI reject: [V: 04-1] netmon: Add regex to match all the netmon instances. Bug: T315523 Change-Id: I4d1e56b42486bbaafc96248fd0d4871555e64d2d [puppet] - 10https://gerrit.wikimedia.org/r/854946 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [08:32:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:32:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:32:44] XioNoX: I just woke up with the page [08:32:49] (03CR) 10Elukey: [C: 03+1] calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:33:09] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:852838|Update Metrics Platform streams (T322277)]] (duration: 08m 17s) [08:33:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:33:11] Amir1: good morning, apologies from letting it escalate to batphone [08:33:16] T322277: Generate Edit Attempt test data - https://phabricator.wikimedia.org/T322277 [08:33:16] I get to it soon. If it's depooled then we are good for now. Thanks [08:33:23] Oh no worries [08:33:24] phuedx: Done. [08:33:30] kart_: TYVM [08:33:32] * Emperor passes Amir.1 some coffee [08:34:00] (03PS3) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. Bug: T315523 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) [08:34:15] I have a full Chemex if you need any more coffee [08:34:30] abijeet: Our patch is next.. [08:34:35] (03CR) 10CI reject: [V: 04-1] netmon: Add regex to match all the netmon instances. Bug: T315523 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [08:35:06] (03Abandoned) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. Bug: T315523 Change-Id: I4d1e56b42486bbaafc96248fd0d4871555e64d2d [puppet] - 10https://gerrit.wikimedia.org/r/854946 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [08:35:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10fgiunchedi) I have sent the temporary credentials via email following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principa... [08:36:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10fgiunchedi) I have sent the temporary credentials via email following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_fo... [08:36:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10fgiunchedi) 05Open→03Resolved [08:36:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10fgiunchedi) 05Open→03Resolved I have sent the temporary credentials via email following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerb... [08:37:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro) [08:37:38] (03PS6) 10Abijeet Patro: Add channel for MessageBundle feature of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) [08:38:24] Emperor: <3 [08:38:52] (03CR) 10TrainBranchBot: "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro) [08:39:24] (03PS4) 10Ayounsi: Add Peering News to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/849114 [08:39:36] (03Merged) 10jenkins-bot: Add channel for MessageBundle feature of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro) [08:39:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1018.eqiad.wmnet with OS bullseye [08:39:48] !log kartik@deploy1002 Started scap: Backport for [[gerrit:853357|Add channel for MessageBundle feature of Translate extension (T322430)]] [08:39:48] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS bullseye completed: - ganeti1018 (**PASS**) - Downtimed on... [08:39:51] T322430: Message bundle groups not created for pages with translate-messagebundle page content model - https://phabricator.wikimedia.org/T322430 [08:40:08] !log kartik@deploy1002 kartik and abi: Backport for [[gerrit:853357|Add channel for MessageBundle feature of Translate extension (T322430)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:40:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS bullseye [08:40:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bullseye [08:40:41] (03PS4) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) [08:41:21] (03CR) 10Ayounsi: Add Peering News to Puppet (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [08:42:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:42:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:42:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:42:43] (03CR) 10Andrea Denisse: "PCC results: https://gerrit.wikimedia.org/r/#/c/854624/" [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [08:42:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P38773 and previous config saved to /var/cache/conftool/dbconfig/20221109-084254-ladsgroup.json [08:42:58] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:43:08] RECOVERY - Check systemd state on serpens is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:44:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:44:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:44:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:45:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:45:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:45:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:45:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:45:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P38774 and previous config saved to /var/cache/conftool/dbconfig/20221109-084525-ladsgroup.json [08:48:22] (03CR) 10David Caro: [C: 03+2] labs: Add header and footer comment to avoid git conflicts (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/854870 (owner: 10David Caro) [08:48:40] (03CR) 10David Caro: [V: 03+2 C: 03+2] labs: Add header and footer comment to avoid git conflicts [labs/private] - 10https://gerrit.wikimedia.org/r/854870 (owner: 10David Caro) [08:48:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) Hello @David.pujol ! I'll be processing this request. Overall looks good to me, though note that as a contractor we'll be adding you to `nda` group not `wmf`.... [08:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P38775 and previous config saved to /var/cache/conftool/dbconfig/20221109-084934-ladsgroup.json [08:49:39] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:49:44] 10SRE, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10phuedx) [08:50:02] 10SRE, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10phuedx) [08:51:08] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:853357|Add channel for MessageBundle feature of Translate extension (T322430)]] (duration: 11m 19s) [08:51:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P38776 and previous config saved to /var/cache/conftool/dbconfig/20221109-085109-ladsgroup.json [08:51:11] T322430: Message bundle groups not created for pages with translate-messagebundle page content model - https://phabricator.wikimedia.org/T322430 [08:51:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) [08:52:44] RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [08:54:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage [08:55:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:55:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [08:55:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:55:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:55:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38777 and previous config saved to /var/cache/conftool/dbconfig/20221109-085542-ladsgroup.json [08:55:47] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:56:08] (03CR) 10Ayounsi: [C: 04-1] "1st pass, not tested but seems fine overall, mostly naming changes required." [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [08:57:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage [08:59:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 54994 [09:00:25] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:01:12] (03PS1) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [09:01:37] (03PS1) 10Filippo Giunchedi: admin: add dpujol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670) [09:03:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 54994 [09:03:09] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38038/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:03:09] PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [09:03:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) I have copy/pasted the expiry dates from other @tmlt.io folks, please @Htriedman confirm I got that right on https://gerrit.wikimedia.or... [09:04:07] (03CR) 10Volans: "couple of questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [09:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38778 and previous config saved to /var/cache/conftool/dbconfig/20221109-090441-ladsgroup.json [09:04:44] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10hashar) 05Open→03Resolved a:03hashar https://gerrit.wikimedia.org/r/c/operations/puppet/+/844515 added `profile::mediawiki::scap_client::is_master: true` to... [09:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38779 and previous config saved to /var/cache/conftool/dbconfig/20221109-090616-ladsgroup.json [09:07:18] RECOVERY - MariaDB Replica SQL: s2 #page on db1182 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:07:19] !log installing nodejs security updates [09:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:31] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) 05Open→03Declined Most dependencies on Pypi... [09:12:33] (03CR) 10Vgutierrez: [C: 03+2] "Thanks John!" [puppet] - 10https://gerrit.wikimedia.org/r/854571 (owner: 10Jbond) [09:13:25] (03PS1) 10Filippo Giunchedi: admin: update ssh key for Sam Smith [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) [09:13:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1024.eqiad.wmnet with OS bullseye [09:14:01] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bullseye completed: - ganeti1024 (**PASS**) - Downtimed on... [09:14:56] (03CR) 10Filippo Giunchedi: "Request confirmed on Meet" [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) (owner: 10Filippo Giunchedi) [09:15:10] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/38038/" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:18:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:19:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I think applying netmon role to just-provisioned hosts is safe" [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:19:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38780 and previous config saved to /var/cache/conftool/dbconfig/20221109-091947-ladsgroup.json [09:21:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38781 and previous config saved to /var/cache/conftool/dbconfig/20221109-092122-ladsgroup.json [09:21:40] (03CR) 10Vgutierrez: [C: 03+1] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [09:22:27] (03PS2) 10Vgutierrez: trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:22:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10fgiunchedi) p:05Triage→03Medium [09:22:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) p:05Triage→03Medium [09:23:42] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:27:17] (03CR) 10Ayounsi: Add Peering News to Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [09:28:23] (03CR) 10Phuedx: [C: 03+1] "Thanks for getting to my request so quickly, Filippo!" [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) (owner: 10Filippo Giunchedi) [09:30:48] (03CR) 10Filippo Giunchedi: [C: 03+2] "Sure np!" [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) (owner: 10Filippo Giunchedi) [09:32:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've confirmed the request on Meet and patch is merged, new access will be live in the next 30 min. I'm resolving the task though f... [09:33:12] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mwdebug: Final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/854559 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [09:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P38782 and previous config saved to /var/cache/conftool/dbconfig/20221109-093454-ladsgroup.json [09:34:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:34:59] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:35:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:35:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [09:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P38783 and previous config saved to /var/cache/conftool/dbconfig/20221109-093629-ladsgroup.json [09:36:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [09:36:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [09:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P38784 and previous config saved to /var/cache/conftool/dbconfig/20221109-093650-ladsgroup.json [09:37:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:37:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:37:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38785 and previous config saved to /var/cache/conftool/dbconfig/20221109-093751-ladsgroup.json [09:38:09] (03PS3) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013) [09:40:26] 10SRE, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10ayounsi) Another relevant presentation: [[ https://www.youtube.com/watch?v=2EekU76VMG4 | NANOG: Lifecycle Of Vendor Maintenances At Meta's Backbone ]] [09:43:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [09:45:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P38786 and previous config saved to /var/cache/conftool/dbconfig/20221109-094506-ladsgroup.json [09:45:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:46:15] (03CR) 10Volans: "replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [09:46:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [09:49:14] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add regex to match all the netmon instances. [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:49:56] (03PS3) 10Hashar: gerrit: remove gerrit-theme.js [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) [09:53:35] (03PS1) 10Volans: sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) [09:54:11] (03CR) 10Volans: "This can be merged anytime, as it doesn't depend on the migration of the status in Netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [09:55:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [09:58:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38787 and previous config saved to /var/cache/conftool/dbconfig/20221109-100013-ladsgroup.json [10:01:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [10:02:09] !log set Netbox status to Active for 299 devices with role=server, tenant=none, status=staged - T320696 [10:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:13] T320696: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 [10:02:59] PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [10:03:51] ^ these warnings are expected and will go away soonish, nodes are being reimaged and when re-added the instance count gets reshuffled [10:04:41] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: connect to address 208.80.153.9 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:06:18] (03PS2) 10JMeybohm: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [10:06:20] (03PS1) 10JMeybohm: cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 [10:06:22] (03PS1) 10JMeybohm: admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967 [10:06:24] (03PS1) 10JMeybohm: CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968 [10:07:09] PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: connect to address 208.80.153.9 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:07:16] (03PS1) 10Volans: Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 [10:07:18] (03PS1) 10Volans: Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) [10:07:52] (03PS11) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [10:07:54] (03PS13) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [10:11:37] (03CR) 10CI reject: [V: 04-1] CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [10:11:42] (03PS2) 10JMeybohm: cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 (https://phabricator.wikimedia.org/T306165) [10:11:44] (03PS2) 10JMeybohm: admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967 [10:11:46] (03PS2) 10JMeybohm: CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968 [10:11:48] (03PS3) 10JMeybohm: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [10:12:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38040/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:14:29] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:14:53] RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sun 29 Jan 2023 01:19:54 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38788 and previous config saved to /var/cache/conftool/dbconfig/20221109-101519-ladsgroup.json [10:15:57] (03PS12) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [10:15:59] (03PS14) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [10:16:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1018.eqiad.wmnet to cluster eqiad and group B [10:16:29] (03CR) 10jenkins-bot: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [10:17:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38042/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:17:33] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet [10:17:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1018.eqiad.wmnet to cluster eqiad and group B [10:22:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet [10:23:51] (03CR) 10Clément Goubert: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [10:24:01] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove netmon1002 from alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [10:26:27] PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [10:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P38791 and previous config saved to /var/cache/conftool/dbconfig/20221109-103026-ladsgroup.json [10:30:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:30:31] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:30:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:30:43] (03PS1) 10Andrea Denisse: netmon: Add netmon2002 to the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) [10:32:54] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38043/console" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [10:34:01] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/38043/" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [10:35:01] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [10:35:25] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967 (owner: 10JMeybohm) [10:37:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:37:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P38792 and previous config saved to /var/cache/conftool/dbconfig/20221109-103722-ladsgroup.json [10:37:26] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:37:54] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 [10:38:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38793 and previous config saved to /var/cache/conftool/dbconfig/20221109-103806-ladsgroup.json [10:38:28] (03Merged) 10jenkins-bot: cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [10:39:02] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:39:08] (03Merged) 10jenkins-bot: admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967 (owner: 10JMeybohm) [10:40:02] (03CR) 10CI reject: [V: 04-1] deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 (owner: 10Giuseppe Lavagetto) [10:40:12] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:45:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P38794 and previous config saved to /var/cache/conftool/dbconfig/20221109-104548-ladsgroup.json [10:45:53] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:52:39] (03CR) 10Filippo Giunchedi: "This patch removes 2001 too, I _think_ that's fine as long as 2002 is put in service while 2001 is the inactive server, IMHO best to add n" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [10:53:09] (03CR) 10Filippo Giunchedi: [C: 03+2] gerrit: remove gerrit-theme.js [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [10:53:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38796 and previous config saved to /var/cache/conftool/dbconfig/20221109-105313-ladsgroup.json [10:54:14] (03PS2) 10Hnowlan: Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) [10:56:44] RECOVERY - Ganeti memory on ganeti1013 is OK: OK Memory 89% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [10:58:03] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [10:58:12] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [10:58:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C [10:59:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C [11:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38797 and previous config saved to /var/cache/conftool/dbconfig/20221109-110055-ladsgroup.json [11:04:16] (03CR) 10JMeybohm: [C: 03+2] CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [11:04:19] (03CR) 10JMeybohm: [C: 03+2] CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968 (owner: 10JMeybohm) [11:04:40] (03CR) 10Hashar: [C: 03+2] "The gerrit-theme.js file is gone from Puppet ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/853061 ) and would have to be manually" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853052 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:04:46] (03CR) 10Hashar: [C: 03+2] Move test result table to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853056 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:04:53] (03CR) 10Hashar: [C: 03+2] Move custom CSS style to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853057 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:05:01] (03CR) 10Hashar: [C: 03+2] Move custom links to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853058 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:05:32] (03Merged) 10jenkins-bot: Import gerrit-theme.js history from Puppet [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853052 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:05:51] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [11:05:52] (03Merged) 10jenkins-bot: Move test result table to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853056 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:05:58] (03Merged) 10jenkins-bot: Move custom CSS style to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853057 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:06:04] (03Merged) 10jenkins-bot: Move custom links to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853058 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:06:08] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:07:23] (03CR) 10Btullis: [C: 03+1] eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [11:08:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38798 and previous config saved to /var/cache/conftool/dbconfig/20221109-110819-ladsgroup.json [11:08:36] (03Merged) 10jenkins-bot: CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968 (owner: 10JMeybohm) [11:09:03] (03CR) 10Gmodena: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [11:09:11] (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/853305" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:09:20] (03Merged) 10jenkins-bot: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto) [11:09:56] (03CR) 10Clément Goubert: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [11:15:36] (03CR) 10Hnowlan: [C: 03+2] Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [11:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38799 and previous config saved to /var/cache/conftool/dbconfig/20221109-111601-ladsgroup.json [11:16:31] (03CR) 10Hashar: "Some previous change has moved our Gerrit UI customizations to several ./plugins/*.js file and I felt like I could use eslint for them." [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [11:17:08] (03PS2) 10JMeybohm: cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) [11:17:10] (03PS2) 10JMeybohm: cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 [11:17:12] (03PS2) 10JMeybohm: cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) [11:18:01] (03PS4) 10JMeybohm: calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) [11:19:39] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [11:20:27] (03Merged) 10jenkins-bot: Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [11:21:32] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet [11:23:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38800 and previous config saved to /var/cache/conftool/dbconfig/20221109-112326-ladsgroup.json [11:23:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:23:30] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:23:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:23:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38801 and previous config saved to /var/cache/conftool/dbconfig/20221109-112347-ladsgroup.json [11:25:58] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet [11:28:08] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-etcd1001.eqiad.wmnet [11:28:22] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dse-k8s-etcd1001.eqiad.wmnet [11:28:59] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-etcd1001.eqiad.wmnet [11:29:51] (03Abandoned) 10Hnowlan: postgres: add option to enable replication slots [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149) (owner: 10Hnowlan) [11:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P38802 and previous config saved to /var/cache/conftool/dbconfig/20221109-113108-ladsgroup.json [11:31:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:31:13] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:31:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:31:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [11:31:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [11:31:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [11:31:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P38803 and previous config saved to /var/cache/conftool/dbconfig/20221109-113144-ladsgroup.json [11:32:40] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-etcd1001.eqiad.wmnet [11:33:24] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [11:34:16] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet [11:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [11:38:41] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet [11:39:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P38804 and previous config saved to /var/cache/conftool/dbconfig/20221109-113948-ladsgroup.json [11:39:52] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:40:13] (03PS1) 10Alexandros Kosiaris: arclamp: Add role contact information [puppet] - 10https://gerrit.wikimedia.org/r/854985 [11:43:46] (03CR) 10Muehlenhoff: arclamp: Add role contact information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [11:44:31] (03PS2) 10Clément Goubert: deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 (owner: 10Giuseppe Lavagetto) [11:44:44] (03PS1) 10Muehlenhoff: sre.maps.roll-restart: Also restart nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/854987 [11:46:58] (03CR) 10Filippo Giunchedi: centrallog: add first prototype of webrequest-live with Benthos (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:48:16] (03CR) 10Filippo Giunchedi: "See inline. Also no file here has a newline at the end, which it should" [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:48:52] (03CR) 10Clément Goubert: [C: 03+2] eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [11:49:12] (03CR) 10Filippo Giunchedi: "Also I forgot to add: there's no newline at the end of the files, though there should be I think" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:50:59] (03CR) 10Filippo Giunchedi: centrallog: add first prototype of webrequest-live with Benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:51:18] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:53:44] (03Merged) 10jenkins-bot: eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [11:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38805 and previous config saved to /var/cache/conftool/dbconfig/20221109-115454-ladsgroup.json [11:56:03] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [11:56:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [11:57:54] (03CR) 10Hnowlan: [C: 03+1] sre.maps.roll-restart: Also restart nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/854987 (owner: 10Muehlenhoff) [12:00:07] (03CR) 10Muehlenhoff: [C: 03+2] sre.maps.roll-restart: Also restart nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/854987 (owner: 10Muehlenhoff) [12:01:12] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:03:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [12:03:50] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [12:05:02] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:07:00] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38806 and previous config saved to /var/cache/conftool/dbconfig/20221109-121001-ladsgroup.json [12:10:36] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [12:11:11] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [12:14:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) The install of the updated trafficserver package for bullseye fails on a bullseye host with the message: ` The following packages have unmet dependencies: trafficserver : Depends:... [12:14:18] (03PS1) 10Hnowlan: thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/854990 [12:16:34] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/854991 (https://phabricator.wikimedia.org/T135991) [12:16:35] !log hashar@deploy1002 Started deploy [gerrit/gerrit@b83625a]: gerrit2002: Gerrit JavaScript plugins as standalone files # T319378 [12:16:38] (03PS1) 10Alexandros Kosiaris: DNM: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [12:16:39] T319378: Move Gerrit Javascript plugins from gerrit-theme.js to standalone files in the deploy repository - https://phabricator.wikimedia.org/T319378 [12:16:45] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b83625a]: gerrit2002: Gerrit JavaScript plugins as standalone files # T319378 (duration: 00m 10s) [12:17:27] (03CR) 10CI reject: [V: 04-1] DNM: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [12:17:45] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [12:18:17] (03PS2) 10Alexandros Kosiaris: arclamp: Add role contact information [puppet] - 10https://gerrit.wikimedia.org/r/854985 [12:18:18] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [12:20:19] (03CR) 10Alexandros Kosiaris: arclamp: Add role contact information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [12:22:21] (03CR) 10Hnowlan: [C: 03+2] thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/854990 (owner: 10Hnowlan) [12:23:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854555 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:24:04] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:24:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38807 and previous config saved to /var/cache/conftool/dbconfig/20221109-122403-ladsgroup.json [12:24:08] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:24:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:24:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I think the same case can be made for Xenon (role::webperf::processors_and_site), but could also be a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [12:24:53] jouncebot: now [12:24:53] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [12:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P38808 and previous config saved to /var/cache/conftool/dbconfig/20221109-122507-ladsgroup.json [12:25:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:25:10] I am going to do deploy a change to Gerrit to move our js plugins to standalone files [12:25:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:25:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P38809 and previous config saved to /var/cache/conftool/dbconfig/20221109-122528-ladsgroup.json [12:25:52] !log hashar@deploy1002 Started deploy [gerrit/gerrit@b83625a]: gerrit1001: Gerrit JavaScript plugins as standalone files # T319378 [12:25:56] T319378: Move Gerrit Javascript plugins from gerrit-theme.js to standalone files in the deploy repository - https://phabricator.wikimedia.org/T319378 [12:26:00] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [12:26:01] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b83625a]: gerrit1001: Gerrit JavaScript plugins as standalone files # T319378 (duration: 00m 09s) [12:26:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [12:26:47] (03Merged) 10jenkins-bot: thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/854990 (owner: 10Hnowlan) [12:26:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [12:28:02] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:28:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans) [12:28:58] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:29:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [12:30:49] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:31:04] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:31:27] (03PS2) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) [12:33:17] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [12:33:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P38810 and previous config saved to /var/cache/conftool/dbconfig/20221109-123344-ladsgroup.json [12:33:48] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:33:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [12:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38811 and previous config saved to /var/cache/conftool/dbconfig/20221109-123910-ladsgroup.json [12:39:26] (03CR) 10Volans: "Some inline comments to simplify the code and make it a bit more modern ;)" [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [12:42:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) >>! In T321309#8382745, @ssingh wrote: > I am not aware of the reasons why we build with `BACKPORT=yes` but just to confirm that there are no other differences: Probably... [12:42:39] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [12:43:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [12:45:06] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [12:45:22] (03CR) 10Ayounsi: [C: 03+1] Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans) [12:48:31] (03CR) 10Ayounsi: [C: 03+1] Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [12:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38812 and previous config saved to /var/cache/conftool/dbconfig/20221109-124850-ladsgroup.json [12:48:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 29169 [12:49:40] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) >>! In T321309#8382869, @MoritzMuehlenhoff wrote: >>>! In T321309#8382745, @ssingh wrote: >> I am not aware of the reasons why we build with `BACKPORT=yes` but just to confirm that t... [12:49:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 29169 [12:50:32] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [12:50:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 63199 [12:51:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [12:54:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63199 [12:54:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38813 and previous config saved to /var/cache/conftool/dbconfig/20221109-125416-ladsgroup.json [12:55:23] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:55:27] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [12:57:28] (03CR) 10Clément Goubert: [C: 03+2] "All deployed, your canary releases now get traffic." [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [13:03:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38814 and previous config saved to /var/cache/conftool/dbconfig/20221109-130357-ladsgroup.json [13:05:54] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 169 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:07:52] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38815 and previous config saved to /var/cache/conftool/dbconfig/20221109-130923-ladsgroup.json [13:09:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:09:28] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:09:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:09:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P38816 and previous config saved to /var/cache/conftool/dbconfig/20221109-130944-ladsgroup.json [13:13:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet [13:13:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P38817 and previous config saved to /var/cache/conftool/dbconfig/20221109-131351-ladsgroup.json [13:17:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet [13:18:48] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:19:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P38818 and previous config saved to /var/cache/conftool/dbconfig/20221109-131903-ladsgroup.json [13:19:08] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:22:06] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:24:01] !log drain ganeti1013 for eventual reimage to bullseye T311687 [13:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:05] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [13:28:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38820 and previous config saved to /var/cache/conftool/dbconfig/20221109-132858-ladsgroup.json [13:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38821 and previous config saved to /var/cache/conftool/dbconfig/20221109-134404-ladsgroup.json [13:50:09] (03PS1) 10Filippo Giunchedi: clinic-duty: update Lumen regex and tests [software] - 10https://gerrit.wikimedia.org/r/854996 [13:51:40] if anyone is up for an easy one and/or has used ops-maint-gcal.js [13:56:59] (03PS1) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998 [13:58:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P38822 and previous config saved to /var/cache/conftool/dbconfig/20221109-135911-ladsgroup.json [13:59:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [13:59:15] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:59:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [13:59:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P38823 and previous config saved to /var/cache/conftool/dbconfig/20221109-135943-ladsgroup.json [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T1400). [14:00:04] WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:23] o/ [14:01:32] I'm in a meeting. So if anyone can do the deployment for me would be nice :-) [14:01:47] o/ [14:01:52] I can deploy, I think [14:02:23] WMDE-Fisch: would you be able to test your change on mwdebug? [14:02:30] Yes [14:02:34] ok [14:02:42] (03PS2) 10Lucas Werkmeister (WMDE): Enable show nearby feature for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal) [14:02:45] Prepared a tab for it ;-) [14:03:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal) [14:03:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P38824 and previous config saved to /var/cache/conftool/dbconfig/20221109-140351-ladsgroup.json [14:03:52] (03Merged) 10jenkins-bot: Enable show nearby feature for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal) [14:04:00] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:04:06] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:854513|Enable show nearby feature for ruwiki (T321548)]] [14:04:09] T321548: Deploy Show Nearby feature to ruwiki - https://phabricator.wikimedia.org/T321548 [14:04:26] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and lilients: Backport for [[gerrit:854513|Enable show nearby feature for ruwiki (T321548)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:04:36] WMDE-Fisch: ^ [14:05:39] Lucas_WMDE: Works like a charm. Thanks. Go on! [14:05:45] ok! [14:06:20] (03PS2) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998 [14:08:18] (03PS13) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [14:08:20] (03PS15) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [14:08:24] (03CR) 10Elukey: Add a basic puppetization for Benthos (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [14:08:28] (03CR) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [14:09:49] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:854513|Enable show nearby feature for ruwiki (T321548)]] (duration: 05m 42s) [14:09:53] T321548: Deploy Show Nearby feature to ruwiki - https://phabricator.wikimedia.org/T321548 [14:09:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:10:07] anything else to deploy? [14:11:04] !log UTC afternoon backport+config window done [14:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:12:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:13:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:16:03] (03CR) 10Elukey: [C: 03+1] calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:18:32] (03CR) 10JMeybohm: [C: 03+2] calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:18:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38825 and previous config saved to /var/cache/conftool/dbconfig/20221109-141858-ladsgroup.json [14:21:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] vopsbot: always restart the service via systemd [puppet] - 10https://gerrit.wikimedia.org/r/853939 (owner: 10Giuseppe Lavagetto) [14:22:26] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_smokeping.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:38] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:23:21] (03Merged) 10jenkins-bot: calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:23:30] (03PS2) 10Elukey: aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193) [14:24:24] (03CR) 10JMeybohm: [C: 03+1] aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [14:25:52] (03PS1) 10Bking: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855002 (https://phabricator.wikimedia.org/T313842) [14:30:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [14:30:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [14:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T318605)', diff saved to https://phabricator.wikimedia.org/P38826 and previous config saved to /var/cache/conftool/dbconfig/20221109-143050-ladsgroup.json [14:30:55] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:31:57] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [software] - 10https://gerrit.wikimedia.org/r/854996 (owner: 10Filippo Giunchedi) [14:32:39] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:32:44] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: update Lumen regex and tests [software] - 10https://gerrit.wikimedia.org/r/854996 (owner: 10Filippo Giunchedi) [14:33:23] (03PS3) 10Giuseppe Lavagetto: deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 [14:33:25] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: add new mw releases [puppet] - 10https://gerrit.wikimedia.org/r/855003 [14:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38827 and previous config saved to /var/cache/conftool/dbconfig/20221109-143404-ladsgroup.json [14:35:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 (owner: 10Giuseppe Lavagetto) [14:36:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::deployment_server: add new mw releases [puppet] - 10https://gerrit.wikimedia.org/r/855003 (owner: 10Giuseppe Lavagetto) [14:40:35] !log installing libxml2 security updates [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:25] !log reprepro remove bullseye-wikimedia trafficserver: T321309 [14:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:29] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [14:44:01] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:26] (03CR) 10Elukey: [C: 03+2] aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [14:46:08] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=frwiki` in a tmux at mwmaint1002 (locally applied shorter MentorStore::INACTIVITY_THRESHOLD; T318457) [14:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:12] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [14:46:38] (03PS6) 10Eevans: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) [14:49:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P38828 and previous config saved to /var/cache/conftool/dbconfig/20221109-144912-ladsgroup.json [14:49:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [14:49:17] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:49:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [14:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P38829 and previous config saved to /var/cache/conftool/dbconfig/20221109-144933-ladsgroup.json [14:50:33] !log rolling restart of mw canaries to pick up libxml security update [14:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:05] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I wonder why the pcc run didn't happen with check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans) [14:53:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P38830 and previous config saved to /var/cache/conftool/dbconfig/20221109-145341-ladsgroup.json [14:56:15] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:52] (03CR) 10Eevans: [C: 03+2] Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans) [14:57:21] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:57:22] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [14:58:03] (03Abandoned) 10Bking: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855002 (https://phabricator.wikimedia.org/T313842) (owner: 10Bking) [14:59:24] (03CR) 10Ssingh: [V: 03+1] sslcert: refactor update-ocsp.py to Python 3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854608 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:00:26] (03PS1) 10Bking: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855004 (https://phabricator.wikimedia.org/T313842) [15:02:04] !log installing pixman security updates on buster [15:02:28] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [15:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:42] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1101.eqiad.wmnet [15:03:31] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [15:03:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38831 and previous config saved to /var/cache/conftool/dbconfig/20221109-150351-ladsgroup.json [15:03:57] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:04:19] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [15:04:41] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:04:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:07:15] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:08:34] uhm [15:08:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38832 and previous config saved to /var/cache/conftool/dbconfig/20221109-150848-ladsgroup.json [15:10:28] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet [15:10:42] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-stretch2001.codfw.wmnet [15:11:01] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet [15:12:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet [15:15:03] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:15:18] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:18] (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:39] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:16:12] hello [15:16:26] hmm port 80 is varnish [15:16:27] looking [15:16:41] so the earlier failure on cp1075 must be related [15:16:44] I am ACKing it at least [15:16:49] and v6 only [15:18:35] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2001.codfw.wmnet [15:18:52] seems to be recovering? [15:18:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P38833 and previous config saved to /var/cache/conftool/dbconfig/20221109-151858-ladsgroup.json [15:18:59] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [15:19:08] different host [15:19:15] yep [15:19:17] earlier it was 1075 [15:20:18] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:18] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:22] hmm let's check if pybal is having issues with those hosts too [15:20:49] Nov 9 15:20:39 lvs1017 pybal[15357]: [testlb_80 IdleConnection] WARN: cp1079.eqiad.wmnet (enabled/down/not pooled): Connection to 10.64.16.22:80 failed [15:20:50] here too [15:20:50] yeah [15:20:54] varnish is failing for some reason [15:21:09] but it is text on port 80 [15:21:19] jynus: and? [15:21:45] I would expect port 443 to fail earlier? [15:22:15] different request path [15:22:43] 443 is failing too BTW [15:22:49] and that's expected as soon as varnish crashes :) [15:23:03] not sure if relevant at this point but there is a small spike of traffic to port 80 in eqiad https://w.wiki/5wG5 [15:23:25] o/ here as well [15:23:36] here as well [15:23:49] so, this is just port 80 ? [15:23:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38834 and previous config saved to /var/cache/conftool/dbconfig/20221109-152354-ladsgroup.json [15:23:57] so only the redirect to 443 ? [15:24:04] https://grafana.wikimedia.org/goto/9RGWNJvVk?orgId=1 https://grafana.wikimedia.org/goto/a7aZHJvVz?orgId=1 [15:24:07] ah, no just read backlog [15:24:14] nope, not just port 443 [15:24:19] *80 [15:24:20] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet [15:24:59] pybal is also flagging port 443 in IPv4 && IPv6 down [15:25:03] Nov 09 15:24:27 lvs1017 pybal[15357]: [textlb_443 ProxyFetch] WARN: cp1079.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (https://healthcheck.wikimedia.org/varnish-fe), 5.001 s [15:25:04] Nov 09 15:24:32 lvs1017 pybal[15357]: [textlb6_443 ProxyFetch] WARN: cp1079.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (https://healthcheck.wikimedia.org/varnish-fe), 5.001 s [15:25:06] cp1079 seems not getting traffic back [15:25:28] cp1075 however got back to normal bandwidth [15:29:33] PROBLEM - Check systemd state on aqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:21] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet [15:32:32] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet [15:33:16] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:33:25] PROBLEM - cassandra-b CQL 10.64.131.15:9042 on aqs1020 is CRITICAL: connect to address 10.64.131.15 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:33:25] PROBLEM - cassandra-a CQL 10.64.32.22:9042 on aqs1018 is CRITICAL: connect to address 10.64.32.22 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:33:26] PROBLEM - cassandra-a SSL 10.64.48.119:7001 on aqs1019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:33:53] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2002.codfw.wmnet [15:34:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P38835 and previous config saved to /var/cache/conftool/dbconfig/20221109-153405-ladsgroup.json [15:34:30] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:34:35] RECOVERY - Check systemd state on aqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:47] PROBLEM - cassandra-a SSL 10.64.32.22:7001 on aqs1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:35:47] PROBLEM - cassandra-b SSL 10.64.131.15:7001 on aqs1020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:38:23] PROBLEM - cassandra-b CQL 10.64.48.122:9042 on aqs1019 is CRITICAL: connect to address 10.64.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:39:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P38836 and previous config saved to /var/cache/conftool/dbconfig/20221109-153901-ladsgroup.json [15:39:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:39:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:39:21] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:39:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P38837 and previous config saved to /var/cache/conftool/dbconfig/20221109-153922-ladsgroup.json [15:40:50] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2002.codfw.wmnet [15:40:59] PROBLEM - AQS root url on aqs1021 is CRITICAL: connect to address 10.64.135.7 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:40:59] PROBLEM - cassandra-a CQL 10.64.16.74:9042 on aqs1017 is CRITICAL: connect to address 10.64.16.74 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:40:59] PROBLEM - cassandra-b CQL 10.64.32.31:9042 on aqs1018 is CRITICAL: connect to address 10.64.32.31 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:40:59] PROBLEM - cassandra-b SSL 10.64.48.122:7001 on aqs1019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:41:29] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet [15:42:03] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [15:42:09] PROBLEM - cassandra-b service on aqs1020 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:42:15] vgutierrez: is everything alright from your end ? [15:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P38838 and previous config saved to /var/cache/conftool/dbconfig/20221109-154330-ladsgroup.json [15:43:35] PROBLEM - cassandra-b SSL 10.64.32.31:7001 on aqs1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:43:35] PROBLEM - cassandra-a SSL 10.64.16.74:7001 on aqs1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:43:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "I figured out the prometheus metric prefix, the rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [15:43:51] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [15:44:40] <_joe_> btullis: any idea what's going on with cassandra on aqs? [15:44:44] <_joe_> or urandom [15:44:54] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:45:06] (03PS1) 10JMeybohm: calico: Allow calico-cni access to ipreservations [deployment-charts] - 10https://gerrit.wikimedia.org/r/855011 (https://phabricator.wikimedia.org/T307943) [15:45:11] _joe_ I think that Eric is bootstrapping new nodes [15:45:16] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [15:45:18] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm) [15:45:22] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [15:45:26] <_joe_> ah ok [15:46:05] PROBLEM - cassandra-a CQL 10.64.0.199:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.199 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:46:05] PROBLEM - AQS root url on aqs1020 is CRITICAL: connect to address 10.64.131.7 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:46:06] PROBLEM - cassandra-a service on aqs1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:47:07] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:47:15] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:47:22] nvm [15:47:25] _joe_: note-to-self, downtime hosts when bootstrapping new nodes [15:47:29] (sorry...) [15:47:46] (03PS1) 10JMeybohm: calico: Allow different versions, drop pre bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943) [15:47:51] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [15:48:43] PROBLEM - cassandra-b CQL 10.64.16.78:9042 on aqs1017 is CRITICAL: connect to address 10.64.16.78 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:49:06] PROBLEM - cassandra-a service on aqs1019 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:49:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38839 and previous config saved to /var/cache/conftool/dbconfig/20221109-154911-ladsgroup.json [15:49:13] PROBLEM - cassandra-b service on aqs1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:49:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [15:49:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [15:49:29] PROBLEM - cassandra-a service on aqs1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:49:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T318605)', diff saved to https://phabricator.wikimedia.org/P38840 and previous config saved to /var/cache/conftool/dbconfig/20221109-154933-ladsgroup.json [15:49:35] PROBLEM - cassandra-b service on aqs1019 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:50:15] (03Merged) 10jenkins-bot: cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [15:50:17] (03Merged) 10jenkins-bot: cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm) [15:50:36] (03Merged) 10jenkins-bot: cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [15:51:17] PROBLEM - AQS root url on aqs1019 is CRITICAL: connect to address 10.64.48.147 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:51:17] PROBLEM - cassandra-a CQL 10.64.135.14:9042 on aqs1021 is CRITICAL: connect to address 10.64.135.14 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:51:17] PROBLEM - cassandra-b SSL 10.64.16.78:7001 on aqs1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:51:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:52:19] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:49] PROBLEM - cassandra-b CQL 10.64.0.213:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:53:51] PROBLEM - cassandra-b service on aqs1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:53:51] PROBLEM - cassandra-a SSL 10.64.135.14:7001 on aqs1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:55:55] (03CR) 10JMeybohm: [C: 03+2] calico: Allow calico-cni access to ipreservations [deployment-charts] - 10https://gerrit.wikimedia.org/r/855011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:56:21] PROBLEM - AQS root url on aqs1018 is CRITICAL: connect to address 10.64.32.185 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:56:21] PROBLEM - cassandra-b SSL 10.64.0.213:7001 on aqs1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:56:21] PROBLEM - cassandra-a service on aqs1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:57:32] (03CR) 10Ahmon Dancy: "Amir, I'm looking for a +1 from you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [15:58:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38841 and previous config saved to /var/cache/conftool/dbconfig/20221109-155836-ladsgroup.json [15:58:55] PROBLEM - cassandra-b service on aqs1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:58:56] PROBLEM - cassandra-a CQL 10.64.131.14:9042 on aqs1020 is CRITICAL: connect to address 10.64.131.14 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:58:56] PROBLEM - cassandra-b CQL 10.64.135.15:9042 on aqs1021 is CRITICAL: connect to address 10.64.135.15 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:00:56] (03Merged) 10jenkins-bot: calico: Allow calico-cni access to ipreservations [deployment-charts] - 10https://gerrit.wikimedia.org/r/855011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:01:26] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:01:27] PROBLEM - AQS root url on aqs1017 is CRITICAL: connect to address 10.64.16.75 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:01:27] PROBLEM - cassandra-b SSL 10.64.135.15:7001 on aqs1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:01:27] PROBLEM - cassandra-a SSL 10.64.131.14:7001 on aqs1020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:03:55] PROBLEM - cassandra-a CQL 10.64.48.119:9042 on aqs1019 is CRITICAL: connect to address 10.64.48.119 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:03:56] PROBLEM - cassandra-a service on aqs1020 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:03:56] PROBLEM - cassandra-b service on aqs1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:05:25] RECOVERY - AQS root url on aqs1017 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:05:55] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10RobH) [16:08:04] (03CR) 10Volans: [C: 03+2] Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans) [16:08:07] RECOVERY - AQS root url on aqs1018 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:08:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10RobH) [16:08:15] (03PS2) 10Phuedx: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) [16:09:11] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:09:14] (03Merged) 10jenkins-bot: Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans) [16:10:17] RECOVERY - AQS root url on aqs1019 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:11:43] RECOVERY - AQS root url on aqs1020 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:12:17] RECOVERY - AQS root url on aqs1021 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38842 and previous config saved to /var/cache/conftool/dbconfig/20221109-161343-ladsgroup.json [16:13:59] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.199:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.199 and port 9042: Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://phabricator.wikimedia.org/T93886 [16:13:59] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.213:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://phabricator.wikimedia.org/T93886 [16:13:59] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.213:7001 on aqs1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:13:59] ACKNOWLEDGEMENT - cassandra-b service on aqs1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:13:59] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.16.74:9042 on aqs1017 is CRITICAL: connect to address 10.64.16.74 and port 9042: Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://phabricator.wikimedia.org/T93886 [16:14:00] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.74:7001 on aqs1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:14:00] ACKNOWLEDGEMENT - cassandra-a service on aqs1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:15:52] (03CR) 10Ladsgroup: [C: 03+1] Only Enable LBFactory config callback in CLI in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [16:16:20] (03CR) 10Ahmon Dancy: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [16:19:02] (03PS1) 10Jbond: puppet_compiler: bump to 4.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/855020 [16:19:34] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump to 4.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/855020 (owner: 10Jbond) [16:24:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38048/console" [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:28:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P38843 and previous config saved to /var/cache/conftool/dbconfig/20221109-162849-ladsgroup.json [16:28:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:29:00] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:29:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:34:11] (03PS1) 10Hnowlan: thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) [16:35:03] (03CR) 10CI reject: [V: 04-1] thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:37:32] (03PS1) 10Jbond: puppet_compiler: allow for storing pson files with gzip [puppet] - 10https://gerrit.wikimedia.org/r/855025 [16:37:34] (03PS2) 10Hnowlan: thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) [16:38:32] (03CR) 10Jbond: [C: 03+2] puppet_compiler: allow for storing pson files with gzip [puppet] - 10https://gerrit.wikimedia.org/r/855025 (owner: 10Jbond) [16:40:11] (03PS1) 10Volans: doc: removed STAGED status from Netbox diagram [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/855026 (https://phabricator.wikimedia.org/T320696) [16:41:27] (03PS2) 10Volans: sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) [16:42:19] (03CR) 10Volans: [C: 03+2] Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET serverlessservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:44:06] (03CR) 10Volans: [V: 03+2 C: 03+2] "Image uploaded to wikitech, self-merging." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/855026 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:44:09] (03Merged) 10jenkins-bot: Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:44:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10Htriedman) @fgiunchedi the expiry dates from other @tmlt.io folks are correct! With regards to NDA and final approval: I don't have access to the N... [16:48:07] (03CR) 10Vlad.shapik: [C: 03+1] thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:48:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Htriedman) @Dzahn @KFrancis Yes, that is correct, @dasm is a Tumult Labs contractor working with us on differential privacy. With regards to final approval, @Jcross is the app... [16:48:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:50:09] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:52:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8218 [16:52:57] (03CR) 10Ayounsi: [C: 03+1] netbox: restore 1D TTL on the dyna CNAME [dns] - 10https://gerrit.wikimedia.org/r/854945 (https://phabricator.wikimedia.org/T322700) (owner: 10Volans) [16:53:17] (03CR) 10Volans: [C: 03+2] netbox: restore 1D TTL on the dyna CNAME [dns] - 10https://gerrit.wikimedia.org/r/854945 (https://phabricator.wikimedia.org/T322700) (owner: 10Volans) [16:54:35] (03CR) 10Hnowlan: [C: 03+2] thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:54:45] (03CR) 10Volans: [C: 03+2] dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [16:55:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8218 [16:59:09] (03Merged) 10jenkins-bot: sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:59:11] (03Merged) 10jenkins-bot: dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [16:59:45] (03PS1) 10Daniel Kinzler: mediawiki.org: set VE to new direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029 [17:00:51] (03Merged) 10jenkins-bot: thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:03:10] (03PS2) 10Vgutierrez: ncredir: Add wikimediaenterprise.com rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/852202 (https://phabricator.wikimedia.org/T321804) [17:04:01] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:04:26] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Add wikimediaenterprise.com rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/852202 (https://phabricator.wikimedia.org/T321804) (owner: 10Vgutierrez) [17:08:12] (03PS2) 10JHathaway: aux-k8s: add BGP config for calico [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) [17:09:00] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [17:09:15] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [17:09:16] (03CR) 10Herron: dispatch: sync user role and info from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [17:10:39] 10SRE, 10Traffic, 10Patch-For-Review: Enterprise redirect for wikimediaenterprise.com to enterprise.wikimedia.com - https://phabricator.wikimedia.org/T321804 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez `vgutierrez@ncredir6001:~$ curl -L -I http://wikimediaenterprise.com HTTP/1.1 301 Moved Perma... [17:10:47] (03CR) 10JHathaway: "pushed a new patch, thanks" [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [17:12:25] (03CR) 10JHathaway: aux-k8s: add BGP config for calico (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [17:14:51] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) >>! In T188561#8381427, @EWilfong_WMF wrote: > Thanks for the feedback and requirements documentation, @Vgutierrez. Acoustic,... [17:16:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [17:17:02] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [17:19:00] (03PS1) 10Hnowlan: Decode poolcounter messages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) [17:28:53] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [17:29:10] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [17:31:23] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855034 (https://phabricator.wikimedia.org/T319184) [17:34:59] (03PS2) 10Hnowlan: Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) [17:36:55] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:37:32] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Converted existing STAGED hosts to ACTIVE - volans@cumin1001 - T320696" [17:38:07] T320696: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696 [17:38:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "hey @Cathal could you please update the switch port setup for cloudgw2002-dev (https://netbox.wikimedia.org/dcim/devices/3026/interfaces/)" [puppet] - 10https://gerrit.wikimedia.org/r/855034 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [17:39:44] 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) [17:40:13] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Converted existing STAGED hosts to ACTIVE - volans@cumin1001 - T320696" [17:41:27] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:29] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:35] hmmm [17:44:16] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10Volans) I think this is still pending, and triggers a warning in the sre.dns.netbox cookbook because has IPs with DNS Names but the host is in decom... [17:45:15] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [17:46:44] (03CR) 10Vlad.shapik: [C: 03+1] "You are completely right." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [17:47:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [17:48:39] (03PS1) 10JHathaway: aux-k8s: fix typo in comment [puppet] - 10https://gerrit.wikimedia.org/r/855038 [17:48:41] (03PS1) 10JHathaway: aux-k8s: use default cni-config [puppet] - 10https://gerrit.wikimedia.org/r/855039 (https://phabricator.wikimedia.org/T321120) [17:49:27] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:42] (03PS1) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) [17:51:26] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:57] PROBLEM - Host labstore1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:19] (03PS2) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) [17:52:21] (03PS1) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) [17:53:15] (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix typo in comment [puppet] - 10https://gerrit.wikimedia.org/r/855038 (owner: 10JHathaway) [17:53:24] (03CR) 10JHathaway: [C: 03+2] aux-k8s: use default cni-config [puppet] - 10https://gerrit.wikimedia.org/r/855039 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [17:53:26] (03PS3) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) [17:54:14] (03PS2) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) [17:55:39] 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) We have rate limits in place for some generic UA strings: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m... [17:55:59] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wcqs1003.eqiad.wmnet with reason: data reload [17:56:03] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855042 (https://phabricator.wikimedia.org/T319184) [17:56:05] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2003-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855043 (https://phabricator.wikimedia.org/T319184) [17:56:07] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: hiera: cleanup per-host network overrides [puppet] - 10https://gerrit.wikimedia.org/r/855044 [17:56:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wcqs1003.eqiad.wmnet with reason: data reload [17:57:05] (03CR) 10Nray: [C: 04-1] "blah typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) (owner: 10Nray) [17:57:23] 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) per T316223#8381863 serviceops-core is taking this over [17:57:34] 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) 05Stalled→03Open [17:57:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) [17:57:48] (03PS3) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) [17:57:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) [17:57:58] 10SRE, 10serviceops: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10Dzahn) 05Stalled→03Open per T316223#8381863 serviceops-core is taking this over [17:58:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:59:03] (03CR) 10Arturo Borrero Gonzalez: "hey @cathal, please review the switch config for this host at https://netbox.wikimedia.org/dcim/devices/2069/interfaces/ and +1 when done " [puppet] - 10https://gerrit.wikimedia.org/r/855042 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [17:59:55] (03PS1) 10CDanis: Add block80 [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) [17:59:59] (03PS1) 10Volans: reports: Network ignore empty DNS names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/855048 (https://phabricator.wikimedia.org/T320721) [18:00:05] (03PS4) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) [18:00:52] (03CR) 10Arturo Borrero Gonzalez: "hey @cathal, please configure this host for single NIC https://netbox.wikimedia.org/dcim/devices/2070/interfaces/ and then +1 this patch. " [puppet] - 10https://gerrit.wikimedia.org/r/855043 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [18:00:54] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis) [18:02:09] (03CR) 10CDanis: [C: 03+2] Add block80 [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis) [18:02:41] (03Merged) 10jenkins-bot: Add block80 [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis) [18:04:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) a:05Htriedman→03Jcross @Htriedman Thank you:) confirmed and sounds good. Let me reassign the ticket accordingly. Best, Daniel [18:07:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10Dzahn) @fgiunchedi I asked the same about NDA coverage for Tumult Labs on T322591 and Katie replied at T322591#8377758. Looks like this is covered.... [18:08:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:16] (03PS5) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) [18:10:27] (03CR) 10BCornwall: [C: 03+2] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:10:34] (03PS9) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [18:11:20] (03CR) 10Vlad.shapik: Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [18:14:44] (03CR) 10Ayounsi: [C: 03+1] Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis) [18:15:19] (03CR) 10CDanis: [C: 03+2] Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis) [18:15:57] (03Merged) 10jenkins-bot: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis) [18:17:01] (03CR) 10BCornwall: [C: 03+2] prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [18:17:05] (03PS7) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [18:18:48] (03CR) 10Volans: [C: 03+2] "Tested on netbox-next, merging to clear some errors in the report, I'll address any later comment in a subsequent patch." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/855048 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [18:19:42] (03Merged) 10jenkins-bot: reports: Network ignore empty DNS names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/855048 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [18:19:44] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:20:20] (03PS1) 10CDanis: fix missing colon [homer/public] - 10https://gerrit.wikimedia.org/r/855051 [18:20:30] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:20:48] (03CR) 10CDanis: [C: 03+2] fix missing colon [homer/public] - 10https://gerrit.wikimedia.org/r/855051 (owner: 10CDanis) [18:21:24] (03Merged) 10jenkins-bot: fix missing colon [homer/public] - 10https://gerrit.wikimedia.org/r/855051 (owner: 10CDanis) [18:23:26] (03CR) 10Herron: [C: 03+1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [18:24:30] (03CR) 10Vlad.shapik: Decode poolcounter messages, fix 429 error (033 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [18:24:38] (03CR) 10Vlad.shapik: [C: 04-1] Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [18:25:37] (03PS1) 10CDanis: need to specify tcp protocol? [homer/public] - 10https://gerrit.wikimedia.org/r/855052 [18:25:51] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01186 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:26:10] (03CR) 10CDanis: [C: 03+2] need to specify tcp protocol? [homer/public] - 10https://gerrit.wikimedia.org/r/855052 (owner: 10CDanis) [18:26:46] (03Merged) 10jenkins-bot: need to specify tcp protocol? [homer/public] - 10https://gerrit.wikimedia.org/r/855052 (owner: 10CDanis) [18:28:26] (03CR) 10Vlad.shapik: [C: 04-1] Decode poolcounter messages, fix 429 error (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [18:32:11] (03PS1) 10BCornwall: Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605 [18:33:13] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:33:15] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [18:34:15] (03CR) 10CI reject: [V: 04-1] Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605 (owner: 10BCornwall) [18:34:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:35:13] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:38:41] (03PS2) 10BCornwall: Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605 [18:40:27] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:41:03] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:43:03] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [18:45:37] (03CR) 10BCornwall: [C: 03+2] Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605 (owner: 10BCornwall) [18:49:23] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye [18:49:38] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T1900) [19:00:53] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:01:05] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:01:55] (03PS1) 10RLazarus: homer: Don't accept a commit with an empty message [puppet] - 10https://gerrit.wikimedia.org/r/855055 [19:05:53] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:07:02] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:11:49] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005437 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:13:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from cluster for eventual reimage [19:14:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from cluster for eventual reimage [19:16:10] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the porting." [puppet] - 10https://gerrit.wikimedia.org/r/855055 (owner: 10RLazarus) [19:19:34] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10ayounsi) The DHCP requests were making it to cloudsw1-c8 but not further. cloudsw1-c8 was not creating binding neither (so it was not processing them). I enabled traceoptions... [19:24:23] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:26:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Jclark-ctr) @Cmjohnson ganeti1033 had a bad cable replaced. ganeti1034 is connected properly and has link for management [19:29:06] (03CR) 10Andrea Denisse: netmon: Put the netmon2002 as passive server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [19:35:12] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:35:17] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:35:19] !log root@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:35:25] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1023 (**FAI... [19:37:56] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:38:01] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye [19:46:21] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:47:05] (03PS1) 10Nray: Fix TOC misaligned when max width option is disable [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) [19:47:27] (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/854991 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:47:50] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1033 [19:47:52] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1033 [19:47:56] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1034 [19:47:58] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1034 [19:50:27] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [19:50:55] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:52:11] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [19:53:58] (03PS1) 10Andrew Bogott: netboot cloudvirts: only preserve /srv on cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/855057 (https://phabricator.wikimedia.org/T319042) [19:54:59] (03PS1) 10Ottomata: Create platform-eng-deployers group for deploying airflow platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) [19:56:14] (03CR) 10Andrew Bogott: [C: 03+2] netboot cloudvirts: only preserve /srv on cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/855057 (https://phabricator.wikimedia.org/T319042) (owner: 10Andrew Bogott) [19:57:13] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38054/console" [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [19:59:07] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED [19:59:28] (03CR) 10Ottomata: [V: 03+1] "This should do it, but I can't recall if we need a restart of the keyholder service." [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [20:00:25] (03CR) 10Ottomata: [V: 03+1] Create platform-eng-deployers group for deploying airflow platform_eng (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [20:01:03] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED [20:01:15] !log root@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:01:21] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1023 (**FAI... [20:01:31] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:01:37] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:04:08] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [20:05:54] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [20:07:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [20:10:02] (03Abandoned) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) (owner: 10Nray) [20:11:07] (03CR) 10Dzahn: [C: 03+2] "lgtm, timer and service has been created on aphlict1001" [puppet] - 10https://gerrit.wikimedia.org/r/854991 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:12:15] (03CR) 10Dzahn: [C: 03+2] roles: add/update role contacts for aphlict,miscweb,planet,rt [puppet] - 10https://gerrit.wikimedia.org/r/853454 (owner: 10Dzahn) [20:12:50] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED [20:14:48] (03PS1) 10Dzahn: phabricator: set ServiceOps-Collab as role contacts [puppet] - 10https://gerrit.wikimedia.org/r/855062 [20:16:05] (03CR) 10Dzahn: [C: 03+2] phabricator: set ServiceOps-Collab as role contacts [puppet] - 10https://gerrit.wikimedia.org/r/855062 (owner: 10Dzahn) [20:18:33] (03CR) 10Dzahn: "yea, there was a reason though why this was enabled. it was a follow-up to performance issues in the past. it has been a long time though." [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [20:19:29] (03CR) 10Dzahn: [C: 03+2] "Info: /Stage[main]/Profile::Contacts/Concat[/etc/wikimedia/contacts.yaml]" [puppet] - 10https://gerrit.wikimedia.org/r/855062 (owner: 10Dzahn) [20:21:57] (03CR) 10Dzahn: "@Chad remember this? https://phabricator.wikimedia.org/rOPUP4bf4122b85dcbfc2587c3a30f72eccd8d556ad8b" [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [20:23:11] (03PS2) 10Volans: sre.hosts.provision: use default if in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128) [20:23:35] (03PS4) 10Jon Harald Søby: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) [20:23:58] !log root@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:24:10] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1023 (**FAI... [20:24:17] (03CR) 10Hashar: gerrit: remove git gc aggressive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [20:26:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED [20:26:43] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [20:27:09] (03CR) 10Dzahn: [C: 03+2] gerrit: remove git gc aggressive [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [20:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T318605)', diff saved to https://phabricator.wikimedia.org/P38848 and previous config saved to /var/cache/conftool/dbconfig/20221109-203031-ladsgroup.json [20:30:33] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:30:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:30:39] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye [20:30:44] !log gerrit2002 (gerrit-replica) - restarting gerrit service [20:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:38] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [20:32:01] (03CR) 10Volans: [C: 03+2] "self-merging to unblock dcops" [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans) [20:33:51] (03CR) 10Dzahn: [C: 04-2] "can be abandoned after https://gerrit.wikimedia.org/r/c/operations/puppet/+/853061 ?" [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [20:33:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10Krinkle) [20:35:02] !log gerrit1001 (gerrit) - restarting gerrit service to disable aggressive garbage collection. gerrit:854514 - T237807 [20:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:08] T237807: gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 [20:36:36] (03Merged) 10jenkins-bot: sre.hosts.provision: use default if in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans) [20:38:09] (03CR) 10Dzahn: [C: 03+2] "ack, thanks Antoine. deployed and I also did the gerrit service restart on both servers, just now on prod" [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [20:41:02] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [20:41:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [20:42:26] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [20:45:02] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) [20:45:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P38849 and previous config saved to /var/cache/conftool/dbconfig/20221109-204538-ladsgroup.json [20:45:49] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [20:46:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [20:51:14] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:51:33] (03PS11) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [20:51:58] (03CR) 10Dzahn: dumps/distribution: add more data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [20:52:04] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 [20:52:09] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [20:52:41] (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis) [20:53:25] (03CR) 10Dzahn: "hrmm. still syntax error modules/wmflib/types/dumps/mirror.pp, line: 13" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [20:53:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [20:54:46] (03CR) 10Dzahn: "brackets! [{ not {[ :)" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [20:55:13] (03PS12) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [20:55:50] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 [20:58:20] (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis) [20:59:10] (03CR) 10Dzahn: "parameter 'rsync_mirrors' index 5 entry 'addeddate' expects a String[1] value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [20:59:30] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 [21:00:03] (03CR) 10Dzahn: "entry 'active' expects a match for Stdlib::Yes_no = Pattern[/\A(?i:(yes|no))\z/], got 'notrightnow' hahaha" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T2100). [21:00:05] dancy, nray, and Jhs: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] o/ im here [21:00:33] * TheresNoTime can deploy! [21:00:35] present [21:00:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P38850 and previous config saved to /var/cache/conftool/dbconfig/20221109-210044-ladsgroup.json [21:01:06] Jhs: I'll start with yours :) [21:01:11] cool :) [21:01:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) (owner: 10Jon Harald Søby) [21:01:32] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:01:38] (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis) [21:01:59] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 [21:02:07] o/ [21:02:11] (03Merged) 10jenkins-bot: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) (owner: 10Jon Harald Søby) [21:02:28] hey dancy, just doing the other config change (854618), can do yours next? [21:02:29] !log samtar@deploy1002 Started scap: Backport for [[gerrit:854618|Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis (T322696)]] [21:02:33] T322696: Interlanguage links to no.wikipedia.org on wikis that use Wikipedia as the interlanguage link target should use the language name "Norsk bokmål" instead of just "Norsk" - https://phabricator.wikimedia.org/T322696 [21:02:42] TheresNoTime: ok! [21:02:49] !log samtar@deploy1002 samtar and jhsoby: Backport for [[gerrit:854618|Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis (T322696)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:03:28] Jhs: that's live on mwdebug, can you test? :) [21:03:59] TheresNoTime, confirmed, looks correct on all affected wikis 👍 [21:04:12] (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis) [21:04:19] great, syncing [21:05:10] (03PS3) 10Samtar: Only Enable LBFactory config callback in CLI in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [21:05:24] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:06:56] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:07:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10Ryasmeen) [21:08:36] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:854618|Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis (T322696)]] (duration: 06m 06s) [21:08:40] Jhs: that should be live now :) [21:08:41] T322696: Interlanguage links to no.wikipedia.org on wikis that use Wikipedia as the interlanguage link target should use the language name "Norsk bokmål" instead of just "Norsk" - https://phabricator.wikimedia.org/T322696 [21:08:47] TheresNoTime, great, thank you! [21:08:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [21:09:02] (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 [21:09:14] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1023.eqiad.wmnet with OS bullseye [21:09:21] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye completed: - cloudvirt1023 (**WARN**) - Re... [21:09:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:09:37] (03Merged) 10jenkins-bot: Only Enable LBFactory config callback in CLI in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [21:09:50] !log samtar@deploy1002 Started scap: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] [21:09:54] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [21:10:10] !log samtar@deploy1002 samtar and dancy: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:10:17] (03CR) 10Samtar: [C: 03+2] "deploying — starting merge" [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) (owner: 10Nray) [21:10:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:10:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:10:23] (03PS1) 10JHathaway: aux-k8s: remove istio mesh values [deployment-charts] - 10https://gerrit.wikimedia.org/r/855092 (https://phabricator.wikimedia.org/T321120) [21:10:27] dancy: live on mwdebug :) [21:10:30] testing... [21:11:05] (03PS6) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 [21:11:07] (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis) [21:11:14] (03PS2) 10Daniel Kinzler: mediawiki.org: set VE to new direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029 [21:11:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:11:30] TheresNoTime: confirmed. [21:11:37] syncin' [21:12:20] nray: your patch is merging now FYI [21:12:32] TheresNoTime: thank you! [21:13:07] (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis) [21:14:55] (03CR) 10RLazarus: [C: 03+2] homer: Don't accept a commit with an empty message [puppet] - 10https://gerrit.wikimedia.org/r/855055 (owner: 10RLazarus) [21:15:00] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew) 05Open→03Resolved [21:15:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Andrew) [21:15:32] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] (duration: 05m 41s) [21:15:37] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [21:15:44] (03PS7) 10CDanis: No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) [21:15:46] Gracias! [21:15:46] dancy: that's live now :) [21:15:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T318605)', diff saved to https://phabricator.wikimedia.org/P38851 and previous config saved to /var/cache/conftool/dbconfig/20221109-211551-ladsgroup.json [21:15:53] yw! [21:15:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [21:15:56] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:16:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [21:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T318605)', diff saved to https://phabricator.wikimedia.org/P38852 and previous config saved to /var/cache/conftool/dbconfig/20221109-211613-ladsgroup.json [21:16:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:17:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:17:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:17:58] (03PS1) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 [21:18:06] (03CR) 10CDanis: "PCC confirms no-op on both experimental and normal hosts https://puppet-compiler.wmflabs.org/pcc-worker1003/38062/" [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [21:18:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:19:26] (03CR) 10Dzahn: "ArielGlenn, Hokwelum: now there would first be this more simple change to look at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/85" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [21:19:38] (03CR) 10Andrew Bogott: [C: 03+2] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [21:19:56] (03CR) 10CI reject: [V: 04-1] dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [21:20:25] (03PS8) 10CDanis: No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) [21:21:16] 855068 *almost* merged :) [21:21:44] 👍 [21:21:58] (03CR) 10CDanis: "updated pcc still lgtm https://puppet-compiler.wmflabs.org/pcc-worker1001/38063/" [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [21:22:42] (03CR) 10JHathaway: [C: 03+2] aux-k8s: remove istio mesh values [deployment-charts] - 10https://gerrit.wikimedia.org/r/855092 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [21:23:07] (03CR) 10CI reject: [V: 04-1] No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [21:23:44] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/855097 [21:25:56] (03Merged) 10jenkins-bot: Fix TOC misaligned when max width option is disable [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) (owner: 10Nray) [21:26:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) (owner: 10Nray) [21:26:35] !log samtar@deploy1002 Started scap: Backport for [[gerrit:855068|Fix TOC misaligned when max width option is disable (T322162)]] [21:26:40] T322162: [M] Table of contents misaligned with max width disabled - https://phabricator.wikimedia.org/T322162 [21:26:55] !log samtar@deploy1002 samtar and nray: Backport for [[gerrit:855068|Fix TOC misaligned when max width option is disable (T322162)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:26:57] nray: that's live on mwdebug now, can you test? [21:27:28] (03PS2) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 [21:27:36] TheresNoTime: thank you, which server is it on? [21:27:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) When I attempt to pull up https://ganeti1033.mgmt.eqiad.wmnet I get 'Bad Request' ` Bad Request Your browser sent a request that this server could not unde... [21:27:40] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/855097 (owner: 10Ahmon Dancy) [21:28:02] nray: use `mwdebug1001` :) [21:28:03] (03CR) 10CI reject: [V: 04-1] dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [21:28:08] k, checking [21:28:15] (03PS3) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 [21:28:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:28:29] (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/855097 (owner: 10Ahmon Dancy) [21:28:50] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855098 [21:29:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:29:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:29:54] (03PS2) 10Andrew Bogott: Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312) [21:29:56] (03PS1) 10Andrew Bogott: Upgrade openstack libs on Bullseye VMs to version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/855099 [21:30:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:30:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T318605)', diff saved to https://phabricator.wikimedia.org/P38853 and previous config saved to /var/cache/conftool/dbconfig/20221109-213010-ladsgroup.json [21:30:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:31:08] TheresNoTime: things look good! You may proceed! [21:31:15] syncin'! [21:32:03] (03CR) 10CI reject: [V: 04-1] Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [21:32:24] (03CR) 10Andrew Bogott: [C: 03+2] Upgrade openstack libs on Bullseye VMs to version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/855099 (owner: 10Andrew Bogott) [21:32:31] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855098 [21:32:56] (03PS2) 10Andrew Bogott: Upgrade openstack libs on Bullseye VMs to version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/855099 [21:33:42] TheresNoTime: Thank you for your help! [21:33:48] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 38 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:34:12] nray: no worries! it'll take another few minutes to be live everywhere, and it's worth checking again on production proper :) [21:35:16] (03PS3) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [21:35:24] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:855068|Fix TOC misaligned when max width option is disable (T322162)]] (duration: 08m 48s) [21:35:29] T322162: [M] Table of contents misaligned with max width disabled - https://phabricator.wikimedia.org/T322162 [21:35:49] nray: that should be live now :) [21:35:57] looks good, thank you! [21:36:55] (03CR) 10CDanis: "PCC LGTM (matches my hand-crafted file from manual testing) https://puppet-compiler.wmflabs.org/pcc-worker1001/38065/" [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [21:37:02] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/38066/clouddumps1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:37:03] :) [21:37:08] (03PS6) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [21:37:12] !log closing UTC late backport window [21:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:50] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 12 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:39:12] (03CR) 10CI reject: [V: 04-1] dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:42:49] (03PS4) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [21:43:43] (03PS1) 10RobH: site.pp update for ganeti103[34] [puppet] - 10https://gerrit.wikimedia.org/r/855100 (https://phabricator.wikimedia.org/T314303) [21:43:48] (03CR) 10Dzahn: [C: 04-1] "well..now it fails because it's not in cloud.yaml, whether we need it or not." [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:44:33] (03CR) 10RobH: [C: 03+2] site.pp update for ganeti103[34] [puppet] - 10https://gerrit.wikimedia.org/r/855100 (https://phabricator.wikimedia.org/T314303) (owner: 10RobH) [21:45:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P38854 and previous config saved to /var/cache/conftool/dbconfig/20221109-214516-ladsgroup.json [21:45:25] (03CR) 10Dzahn: [C: 04-1] "rabbit hole alert over here :) maybe it's "Could not find declared class openstack::nova::common::victoria::buster"" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:46:41] (03CR) 10Dzahn: "not sure if this is transient but I just got this on an unrelated change: Could not find declared class openstack::nova::common::victoria:" [puppet] - 10https://gerrit.wikimedia.org/r/855099 (owner: 10Andrew Bogott) [21:47:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [21:47:28] (03PS7) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [21:47:46] (03CR) 10Dzahn: "rebasing after seeing https://gerrit.wikimedia.org/r/c/operations/puppet/+/855099/" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:48:12] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bullseye [21:48:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye [21:49:31] (03CR) 10CI reject: [V: 04-1] dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:50:58] (03PS1) 10Eevans: Add component/gocql to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/855102 (https://phabricator.wikimedia.org/T283838) [21:55:21] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1034.eqiad.wmnet with OS bullseye [21:55:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed w... [21:57:23] (03PS1) 10RobH: adding ganeti103[34] netboot [puppet] - 10https://gerrit.wikimedia.org/r/855105 (https://phabricator.wikimedia.org/T314303) [21:57:42] (03CR) 10RobH: [C: 03+2] adding ganeti103[34] netboot [puppet] - 10https://gerrit.wikimedia.org/r/855105 (https://phabricator.wikimedia.org/T314303) (owner: 10RobH) [21:58:53] (03CR) 10Dzahn: [C: 03+1] "compiling on C:rsync::server (because that works, unlike C:rsync or C:rsync::quickdatacopy which is a defined type)" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [22:00:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P38855 and previous config saved to /var/cache/conftool/dbconfig/20221109-220023-ladsgroup.json [22:01:12] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bullseye [22:01:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye [22:06:20] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1034.eqiad.wmnet with OS bullseye [22:06:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed w... [22:08:07] (03CR) 10Dzahn: "I noticed contint2002 in puppet board as failed and it's because "Could not find class ::role::insetup::unowned"" [puppet] - 10https://gerrit.wikimedia.org/r/852216 (owner: 10Muehlenhoff) [22:11:55] (03PS1) 10Dzahn: site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 [22:13:39] (03PS2) 10Dzahn: site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 [22:14:14] (03CR) 10CI reject: [V: 04-1] site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 (owner: 10Dzahn) [22:14:51] (03CR) 10Dzahn: "CI: Expected one space after 'Bug:' Phabricator: I don't care, already added it :p" [puppet] - 10https://gerrit.wikimedia.org/r/855147 (owner: 10Dzahn) [22:15:15] (03PS3) 10Dzahn: site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) [22:15:22] (03CR) 10Dzahn: [C: 03+2] site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) (owner: 10Dzahn) [22:15:26] (03PS2) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [22:15:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T318605)', diff saved to https://phabricator.wikimedia.org/P38856 and previous config saved to /var/cache/conftool/dbconfig/20221109-221529-ladsgroup.json [22:15:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [22:15:35] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:15:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [22:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T318605)', diff saved to https://phabricator.wikimedia.org/P38857 and previous config saved to /var/cache/conftool/dbconfig/20221109-221551-ladsgroup.json [22:16:51] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/855147" [puppet] - 10https://gerrit.wikimedia.org/r/852216 (owner: 10Muehlenhoff) [22:17:26] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38070/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [22:17:42] (03CR) 10CI reject: [V: 04-1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [22:19:29] (03CR) 10Dzahn: [C: 03+2] "puppet runs on contint2002 again. fwiw a user Admin::Hashuser[stevemunene] was created which seems a new root user from https://phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) (owner: 10Dzahn) [22:23:30] (03CR) 10Dzahn: [C: 03+1] "it's interesting how the compiler lists a host under "Hosts that compile with differences" but when you click it it claims "no change" on " [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [22:25:50] (03CR) 10Dzahn: [C: 03+2] R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [22:28:40] (03CR) 10Dzahn: [C: 03+2] "noop confirmed on various random hosts (doc2001, contint2002, gitlab1003, mw1318, mirror1001).. will also watch puppetboard in a couple mi" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [22:32:42] (03CR) 10Dzahn: [C: 03+2] "also checked in cloud on gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [22:36:28] (03CR) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [22:36:45] (03PS3) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [22:37:08] (03PS4) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [22:39:30] (03CR) 10CI reject: [V: 04-1] rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [22:41:19] (03CR) 10Dzahn: "more of Could not find declared class openstack::nova::common::victoria::buster from unrelated cloud change afaict" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [22:47:34] PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:59] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bullseye [22:52:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye [23:00:50] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [23:03:57] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [23:04:11] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1034.eqiad.wmnet with reason: host reimage [23:07:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1034.eqiad.wmnet with reason: host reimage [23:17:03] !log removing 1 file for legal compliance [23:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:13] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1034.eqiad.wmnet with OS bullseye [23:22:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed with errors: - ganeti10... [23:43:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [23:43:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) @MoritzMuehlenhoff i recall you stating the puppet run fails in the isntaller but then just re-run and its fine? If so, ganeti1034 is ready for ya. [23:44:11] !log removing 2 files for legal compliance [23:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:10] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:57:00] !log removing 1 file for legal compliance [23:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log