[00:01:48] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625
[00:18:01] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625
[00:23:59] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:25:49] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[00:34:06] <wikibugs>	 (03PS3) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625
[00:42:05] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:17:15] <icinga-wm>	 PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:38:52] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:52] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:53:52] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:53:57] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[01:55:55] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[02:01:49] <icinga-wm>	 PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:06:46] <wikibugs>	 (03PS5) 10Gergő Tisza: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541)
[02:07:11] <wikibugs>	 (03PS6) 10Gergő Tisza: [WIP] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541)
[02:08:52] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:45] <icinga-wm>	 PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:58:35] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:01:01] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:02:03] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[03:18:33] <icinga-wm>	 RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:19:13] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:20:47] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[03:27:49] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:43:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:32] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:48] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:02:40] <icinga-wm>	 RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:15:48] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:17:44] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:34:32] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:39:18] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:43:24] <wikibugs>	 (03Abandoned) 10Andrea Denisse: netmon: Add the netmon role to netmon2002 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (owner: 10Andrea Denisse)
[05:40:08] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:12:06] <wikibugs>	 (03PS4) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625
[06:17:30] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38034/console" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (owner: 10Andrea Denisse)
[06:22:44] <wikibugs>	 (03CR) 10Andrea Denisse: "Hello team, here are the PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/38034/" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (owner: 10Andrea Denisse)
[06:33:24] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[06:41:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11404
[06:42:20] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11404
[06:44:43] <wikibugs>	 (03PS1) 10David Caro: labs: Add header and footer comment to avoid git conflicts [labs/private] - 10https://gerrit.wikimedia.org/r/854870
[06:53:52] <wikibugs>	 (03CR) 10Ayounsi: P:netbox::host: create a motd for the status (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond)
[06:55:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans)
[06:56:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rename Telia to Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/829558 (owner: 10Ayounsi)
[06:56:53] <wikibugs>	 (03PS4) 10Abijeet Patro: Enable logging for UpdateMessageBundleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430)
[06:58:20] <wikibugs>	 (03PS5) 10Abijeet Patro: Add channel for MessageBundle feature of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430)
[06:58:31] <wikibugs>	 (03CR) 10Abijeet Patro: Add channel for MessageBundle feature of Translate extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro)
[07:02:55] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6774
[07:03:30] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove netmon1002 from alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321)
[07:04:13] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6774
[07:04:21] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8220
[07:05:06] <wikibugs>	 (03PS1) 10Majavah: P:openstack: explicit rules for haproxy backend traffic POC [puppet] - 10https://gerrit.wikimedia.org/r/854875
[07:05:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8220
[07:06:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 61955
[07:06:53] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61955
[07:07:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30990
[07:07:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38035/console" [puppet] - 10https://gerrit.wikimedia.org/r/854875 (owner: 10Majavah)
[07:08:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30990
[07:08:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37613
[07:08:24] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37613
[07:08:25] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38036/console" [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[07:08:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29169
[07:09:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29169
[07:09:21] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15412
[07:10:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15412
[07:10:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3225
[07:10:42] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3225
[07:11:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23889
[07:11:21] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 23889
[07:11:33] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23889
[07:11:39] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 23889
[07:14:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23889
[07:14:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23889
[07:15:40] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37693
[07:16:18] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37693
[07:16:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 54994
[07:17:39] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 54994
[07:18:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8309
[07:19:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8309
[07:19:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29608
[07:20:18] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29608
[07:20:35] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37662
[07:20:50] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37662
[07:21:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37271
[07:21:50] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37271
[07:22:05] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8218
[07:22:33] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8218
[07:22:50] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6461
[07:24:08] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6461
[07:26:06] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:27:34] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:33:33] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:36:19] <_joe_>	 uhm
[07:36:24] <_joe_>	 lists down again?
[07:41:11] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:41:57] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:53:40] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/38036/" [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[07:56:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.760 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:58:48] <wikibugs>	 (03Abandoned) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T0800).
[08:00:05] <jouncebot>	 phuedx and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:18] <phuedx>	 Hello o/
[08:00:18] <wikibugs>	 (03PS1) 10Elukey: toil::ganeti_ifupdown: fix systemctl path [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026)
[08:01:25] <wikibugs>	 (03CR) 10Elukey: "ah no wait it is only on buster that systemctl is on /bin, sigh.. fixing" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey)
[08:02:13] * kart_ is here
[08:03:18] <wikibugs>	 (03CR) 10Muehlenhoff: "On Buster and later /bin is a symlink to /usr/bin, so the net effect should be the same. Or did you run into an actual issue here?" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey)
[08:03:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943
[08:04:32] <wikibugs>	 (03PS5) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523)
[08:04:37] <kart_>	 Anyone else around to deploy or should I go ahead with phuedx's patches?
[08:05:18] <kart_>	 phuedx: Can you rebase the first patch meanwhile? I'm not familiar with that config, so need to make sure rebase is correct :)
[08:05:41] <wikibugs>	 (03CR) 10Elukey: toil::ganeti_ifupdown: fix systemctl path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey)
[08:06:07] <phuedx>	 kart_: It would help if I linked to the correct patch :/ One sec!
[08:06:13] <wikibugs>	 (03PS3) 10Phuedx: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016)
[08:06:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS bullseye
[08:06:36] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS bullseye
[08:06:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[08:07:09] <phuedx>	 kart_: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/854475 is rebased
[08:07:15] <phuedx>	 I've updated the Deployments page on wt
[08:07:21] <phuedx>	 jouncebot refresh
[08:07:21] <jouncebot>	 I refreshed my knowledge about deployments.
[08:07:55] <wikibugs>	 (03PS6) 10Elukey: Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193)
[08:08:22] <wikibugs>	 (03CR) 10Elukey: Upgrade to 1.15.3 (031 comment) [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey)
[08:09:08] <kart_>	 phuedx: OK. Checking..
[08:09:14] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey)
[08:10:27] <kart_>	 phuedx: Deploying. Will ping once it is available to test on mwdebug.
[08:10:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "if that's not enough, we can try later with more lines of #" [labs/private] - 10https://gerrit.wikimedia.org/r/854870 (owner: 10David Caro)
[08:11:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx)
[08:12:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm ok to move this to /bin if it fixes serpens and seaborgium, which I'm assuming we dist-upgraded in place ? Hence the reason why /bin i" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey)
[08:12:39] <wikibugs>	 (03Merged) 10jenkins-bot: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx)
[08:12:56] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:854475|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]]
[08:13:00] <stashbot>	 T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016
[08:13:18] <logmsgbot>	 !log kartik@deploy1002 kartik and phuedx: Backport for [[gerrit:854475|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[08:14:00] <kart_>	 phuedx: Available to test on mwdebug1002/2002/1001/2001 Please test.
[08:14:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. That is in fact a corner case we need to consider: So all buster systems which were installed with buster use the merged usr s" [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey)
[08:15:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854573 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:16:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] grafana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854573 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:16:48] <phuedx>	 kart_: LGTM. Verified on testwiki, hewiki, and enwiki that the sampling rates are 1, 1, and 0 respectively. Thanks!
[08:17:01] <kart_>	 phuedx: cool. deploying..
[08:17:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Remove netmon1002 from alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[08:18:02] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 #page on db1182 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table geo_tags is corrupt: try to repair it on query. Default database: zhwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:18:18] <XioNoX>	 hello hello
[08:18:55] <akosiaris>	 o/
[08:18:58] <wikibugs>	 (03PS6) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277)
[08:19:16] <XioNoX>	 I'm tempted to run the depool replica runbook here
[08:19:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[08:19:25] <akosiaris>	 XioNoX: I was about to suggest
[08:19:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: "My understanding is that we should apply the netmon role to netmon2002 first, or e.g. profile::rancing will fail to rsync to the new passi" [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[08:19:35] <akosiaris>	 go for it
[08:19:40] <XioNoX>	 is it related to the current work kart_ ?
[08:20:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[08:20:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[08:20:18] <akosiaris>	 index for table geo_tags is corrupt: try to repair it on query.
[08:20:22] <akosiaris>	 doubtful
[08:20:45] <logmsgbot>	 !log ayounsi@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P38772 and previous config saved to /var/cache/conftool/dbconfig/20221109-082045-ayounsi.json
[08:20:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1018.eqiad.wmnet with reason: host reimage
[08:21:07] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:854475|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] (duration: 08m 10s)
[08:21:11] <stashbot>	 T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016
[08:21:12] <XioNoX>	 pinging the dba in case they're around Amir1, marostegui
[08:21:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[08:21:36] <XioNoX>	 alright it's depooled, now what?
[08:22:33] * Emperor got emailed, is here
[08:22:40] <kart_>	 phuedx: Over to next patch..
[08:23:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for prometheus-ganeti-exporter [puppet] - 10https://gerrit.wikimedia.org/r/854578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:23:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1018.eqiad.wmnet with reason: host reimage
[08:23:36] <_joe_>	 uh I just got paged
[08:23:40] <_joe_>	 !incidents
[08:23:40] <sirenbot>	 3147 (ACKED)  db1182 (paged)/MariaDB Replica SQL: s2 (paged)
[08:23:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) (owner: 10Phuedx)
[08:23:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] toil::ganeti_ifupdown: fix systemctl path [puppet] - 10https://gerrit.wikimedia.org/r/854936 (https://phabricator.wikimedia.org/T273026) (owner: 10Elukey)
[08:23:43] <volans>	 .me too
[08:23:44] <marostegui>	 XioNoX: I'm off today and not next to my computer 
[08:23:45] <XioNoX>	 yeah I forgot to ack
[08:23:50] <marostegui>	 you can just depooo it
[08:23:55] <marostegui>	 and create a task
[08:23:59] <akosiaris>	 forgot to ack
[08:23:59] <XioNoX>	 marostegui: thx, go back to vacations
[08:24:03] <_joe_>	 depoo is the best neologism
[08:24:10] <akosiaris>	 already depooled
[08:24:12] <Emperor>	 XioNoX: I think once depooled, oh, what maros.tegui said quicker than me
[08:24:15] <_joe_>	 depoo that server please
[08:24:35] <marostegui>	 just downtime it for like 2 days and create a task
[08:24:39] <XioNoX>	 Emperor: thx! will the page resolve or we keep it as ACKed?
[08:24:39] <wikibugs>	 (03Merged) 10jenkins-bot: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) (owner: 10Phuedx)
[08:24:39] <marostegui>	 so we can get to it
[08:24:44] <XioNoX>	 sounds good!
[08:24:52] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:852838|Update Metrics Platform streams (T322277)]]
[08:24:56] <stashbot>	 T322277: Generate Edit Attempt test data - https://phabricator.wikimedia.org/T322277
[08:24:56] <Emperor>	 XioNoX: good question :-/
[08:24:58] <XioNoX>	 thx!
[08:25:07] <marostegui>	 thanks XioNoX 
[08:25:12] <logmsgbot>	 !log kartik@deploy1002 kartik and phuedx: Backport for [[gerrit:852838|Update Metrics Platform streams (T322277)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[08:25:24] <Emperor>	 XioNoX: I think add downtime for a couple of days to make sure it doesn't p.age again
[08:25:36] <kart_>	 phuedx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/852838 available to test on mwdebug
[08:25:43] <XioNoX>	 marostegui: enough communicate, you can go back to aviate and navigate
[08:25:53] <XioNoX>	 Emperor: yep, on it! thx
[08:26:06] <phuedx>	 kart_: Looking
[08:26:15] <Emperor>	 Amir.1 is I think in today, so I suspect he'll have a look when it reaches the right timezone :)
[08:27:17] <wikibugs>	 (03Restored) 10Andrea Denisse: netmon: Add the netmon role to netmon2002 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (owner: 10Andrea Denisse)
[08:27:32] <sobanski>	 XioNoX: just looking at this and lacking scroll back but if the question was about Splunk I’d resolve it there, otherwise I think itll refire in 24h
[08:28:16] <phuedx>	 kart_: LGTM. I double-checked configurations on testwiki and hewiki
[08:29:14] <kart_>	 phuedx: cool. Deploying..
[08:30:35] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1182.eqiad.wmnet with reason: paged then depooled
[08:30:37] <wikibugs>	 (03PS1) 10Volans: netbox: restore 1D TTL on the dyna CNAME [dns] - 10https://gerrit.wikimedia.org/r/854945 (https://phabricator.wikimedia.org/T322700)
[08:30:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1182.eqiad.wmnet with reason: paged then depooled
[08:31:05] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. Bug: T315523 Change-Id: I4d1e56b42486bbaafc96248fd0d4871555e64d2d [puppet] - 10https://gerrit.wikimedia.org/r/854946 (https://phabricator.wikimedia.org/T315523)
[08:31:15] <wikibugs>	 (03PS1) 10Elukey: aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193)
[08:31:16] <XioNoX>	 sobanski: thx, resolved!
[08:31:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[08:31:31] <XioNoX>	 https://phabricator.wikimedia.org/T322720 is ready for DBAs
[08:31:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netmon: Add regex to match all the netmon instances. Bug: T315523 Change-Id: I4d1e56b42486bbaafc96248fd0d4871555e64d2d [puppet] - 10https://gerrit.wikimedia.org/r/854946 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[08:32:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[08:32:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[08:32:44] <Amir1>	 XioNoX: I just woke up with the page
[08:32:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[08:33:09] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:852838|Update Metrics Platform streams (T322277)]] (duration: 08m 17s)
[08:33:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[08:33:11] <XioNoX>	 Amir1: good morning, apologies from letting it escalate to batphone
[08:33:16] <stashbot>	 T322277: Generate Edit Attempt test data - https://phabricator.wikimedia.org/T322277
[08:33:16] <Amir1>	 I get to it soon. If it's depooled then we are good for now. Thanks 
[08:33:23] <Amir1>	 Oh no worries 
[08:33:24] <kart_>	 phuedx: Done.
[08:33:30] <phuedx>	 kart_: TYVM
[08:33:32] * Emperor passes Amir.1 some coffee
[08:34:00] <wikibugs>	 (03PS3) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. Bug: T315523 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523)
[08:34:15] <phuedx>	 I have a full Chemex if you need any more coffee
[08:34:30] <kart_>	 abijeet: Our patch is next..
[08:34:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netmon: Add regex to match all the netmon instances. Bug: T315523 [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[08:35:06] <wikibugs>	 (03Abandoned) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. Bug: T315523 Change-Id: I4d1e56b42486bbaafc96248fd0d4871555e64d2d [puppet] - 10https://gerrit.wikimedia.org/r/854946 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[08:35:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10fgiunchedi) I have sent the temporary credentials via email following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principa...
[08:36:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10fgiunchedi) I have sent the temporary credentials via email following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_fo...
[08:36:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10fgiunchedi) 05Open→03Resolved
[08:36:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10fgiunchedi) 05Open→03Resolved I have sent the temporary credentials via email following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerb...
[08:37:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro)
[08:37:38] <wikibugs>	 (03PS6) 10Abijeet Patro: Add channel for MessageBundle feature of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430)
[08:38:24] <Amir1>	 Emperor: <3
[08:38:52] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro)
[08:39:24] <wikibugs>	 (03PS4) 10Ayounsi: Add Peering News to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/849114
[08:39:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add channel for MessageBundle feature of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro)
[08:39:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1018.eqiad.wmnet with OS bullseye
[08:39:48] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:853357|Add channel for MessageBundle feature of Translate extension (T322430)]]
[08:39:48] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS bullseye completed: - ganeti1018 (**PASS**)   - Downtimed on...
[08:39:51] <stashbot>	 T322430: Message bundle groups not created for pages with translate-messagebundle page content model - https://phabricator.wikimedia.org/T322430
[08:40:08] <logmsgbot>	 !log kartik@deploy1002 kartik and abi: Backport for [[gerrit:853357|Add channel for MessageBundle feature of Translate extension (T322430)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[08:40:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS bullseye
[08:40:34] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bullseye
[08:40:41] <wikibugs>	 (03PS4) 10Andrea Denisse: netmon: Add regex to match all the netmon instances. [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523)
[08:41:21] <wikibugs>	 (03CR) 10Ayounsi: Add Peering News to Puppet (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi)
[08:42:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[08:42:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[08:42:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[08:42:43] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://gerrit.wikimedia.org/r/#/c/854624/" [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[08:42:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[08:42:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P38773 and previous config saved to /var/cache/conftool/dbconfig/20221109-084254-ladsgroup.json
[08:42:58] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[08:43:08] <icinga-wm>	 RECOVERY - Check systemd state on serpens is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[08:44:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[08:44:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[08:44:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[08:45:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[08:45:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:45:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[08:45:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:45:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P38774 and previous config saved to /var/cache/conftool/dbconfig/20221109-084525-ladsgroup.json
[08:48:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] labs: Add header and footer comment to avoid git conflicts (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/854870 (owner: 10David Caro)
[08:48:40] <wikibugs>	 (03CR) 10David Caro: [V: 03+2 C: 03+2] labs: Add header and footer comment to avoid git conflicts [labs/private] - 10https://gerrit.wikimedia.org/r/854870 (owner: 10David Caro)
[08:48:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) Hello @David.pujol ! I'll be processing this request. Overall looks good to me, though note that as a contractor we'll be adding you to `nda` group not `wmf`....
[08:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P38775 and previous config saved to /var/cache/conftool/dbconfig/20221109-084934-ladsgroup.json
[08:49:39] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[08:49:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10phuedx)
[08:50:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10phuedx)
[08:51:08] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:853357|Add channel for MessageBundle feature of Translate extension (T322430)]] (duration: 11m 19s)
[08:51:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P38776 and previous config saved to /var/cache/conftool/dbconfig/20221109-085109-ladsgroup.json
[08:51:11] <stashbot>	 T322430: Message bundle groups not created for pages with translate-messagebundle page content model - https://phabricator.wikimedia.org/T322430
[08:51:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi)
[08:52:44] <icinga-wm>	 RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:54:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[08:54:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage
[08:55:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:55:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[08:55:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:55:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:55:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:55:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38777 and previous config saved to /var/cache/conftool/dbconfig/20221109-085542-ladsgroup.json
[08:55:47] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:56:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "1st pass, not tested but seems fine overall, mostly naming changes required." [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[08:57:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage
[08:59:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 54994
[09:00:25] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:01:12] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523)
[09:01:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add dpujol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/854952 (https://phabricator.wikimedia.org/T322670)
[09:03:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 54994
[09:03:09] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38038/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[09:03:09] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[09:03:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) I have copy/pasted the expiry dates from other @tmlt.io folks, please @Htriedman confirm I got that right on https://gerrit.wikimedia.or...
[09:04:07] <wikibugs>	 (03CR) 10Volans: "couple of questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi)
[09:04:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38778 and previous config saved to /var/cache/conftool/dbconfig/20221109-090441-ladsgroup.json
[09:04:44] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10hashar) 05Open→03Resolved a:03hashar https://gerrit.wikimedia.org/r/c/operations/puppet/+/844515 added `profile::mediawiki::scap_client::is_master: true` to...
[09:06:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38779 and previous config saved to /var/cache/conftool/dbconfig/20221109-090616-ladsgroup.json
[09:07:18] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 #page on db1182 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:07:19] <moritzm>	 !log installing nodejs security updates 
[09:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:31] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) 05Open→03Declined Most dependencies on Pypi...
[09:12:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "Thanks John!" [puppet] - 10https://gerrit.wikimedia.org/r/854571 (owner: 10Jbond)
[09:13:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: update ssh key for Sam Smith [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723)
[09:13:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1024.eqiad.wmnet with OS bullseye
[09:14:01] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bullseye completed: - ganeti1024 (**PASS**)   - Downtimed on...
[09:14:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Request confirmed on Meet" [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) (owner: 10Filippo Giunchedi)
[09:15:10] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/38038/" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[09:18:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[09:19:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I think applying netmon role to just-provisioned hosts is safe" [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[09:19:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38780 and previous config saved to /var/cache/conftool/dbconfig/20221109-091947-ladsgroup.json
[09:21:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38781 and previous config saved to /var/cache/conftool/dbconfig/20221109-092122-ladsgroup.json
[09:21:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[09:22:27] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[09:22:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10fgiunchedi) p:05Triage→03Medium
[09:22:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) p:05Triage→03Medium
[09:23:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[09:27:17] <wikibugs>	 (03CR) 10Ayounsi: Add Peering News to Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi)
[09:28:23] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] "Thanks for getting to my request so quickly, Filippo!" [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) (owner: 10Filippo Giunchedi)
[09:30:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Sure np!" [puppet] - 10https://gerrit.wikimedia.org/r/854954 (https://phabricator.wikimedia.org/T322723) (owner: 10Filippo Giunchedi)
[09:32:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T322723 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've confirmed the request on Meet and patch is merged, new access will be live in the next 30 min. I'm resolving the task though f...
[09:33:12] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mwdebug: Final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/854559 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[09:34:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P38782 and previous config saved to /var/cache/conftool/dbconfig/20221109-093454-ladsgroup.json
[09:34:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[09:34:59] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[09:35:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[09:35:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet
[09:36:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P38783 and previous config saved to /var/cache/conftool/dbconfig/20221109-093629-ladsgroup.json
[09:36:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[09:36:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[09:36:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P38784 and previous config saved to /var/cache/conftool/dbconfig/20221109-093650-ladsgroup.json
[09:37:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[09:37:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[09:37:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38785 and previous config saved to /var/cache/conftool/dbconfig/20221109-093751-ladsgroup.json
[09:38:09] <wikibugs>	 (03PS3) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013)
[09:40:26] <wikibugs>	 10SRE, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10ayounsi) Another relevant presentation: [[ https://www.youtube.com/watch?v=2EekU76VMG4 | NANOG: Lifecycle Of Vendor Maintenances At Meta's Backbone ]]
[09:43:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet
[09:45:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P38786 and previous config saved to /var/cache/conftool/dbconfig/20221109-094506-ladsgroup.json
[09:45:10] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[09:46:15] <wikibugs>	 (03CR) 10Volans: "replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi)
[09:46:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[09:49:14] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add regex to match all the netmon instances. [puppet] - 10https://gerrit.wikimedia.org/r/854624 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[09:49:56] <wikibugs>	 (03PS3) 10Hashar: gerrit: remove gerrit-theme.js [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378)
[09:53:35] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696)
[09:54:11] <wikibugs>	 (03CR) 10Volans: "This can be merged anytime, as it doesn't depend on the migration of the status in Netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[09:55:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[09:58:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:00:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38787 and previous config saved to /var/cache/conftool/dbconfig/20221109-100013-ladsgroup.json
[10:01:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[10:02:09] <volans>	 !log set Netbox status to Active for 299 devices with role=server, tenant=none, status=staged - T320696
[10:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:13] <stashbot>	 T320696: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696
[10:02:59] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[10:03:51] <moritzm>	 ^ these warnings are expected and will go away soonish, nodes are being reimaged and when re-added the instance count gets reshuffled
[10:04:41] <icinga-wm>	 PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: connect to address 208.80.153.9 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[10:06:18] <wikibugs>	 (03PS2) 10JMeybohm: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[10:06:20] <wikibugs>	 (03PS1) 10JMeybohm: cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966
[10:06:22] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967
[10:06:24] <wikibugs>	 (03PS1) 10JMeybohm: CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968
[10:07:09] <icinga-wm>	 PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: connect to address 208.80.153.9 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[10:07:16] <wikibugs>	 (03PS1) 10Volans: Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969
[10:07:18] <wikibugs>	 (03PS1) 10Volans: Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696)
[10:07:52] <wikibugs>	 (03PS11) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981)
[10:07:54] <wikibugs>	 (03PS13) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981)
[10:11:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[10:11:42] <wikibugs>	 (03PS2) 10JMeybohm: cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 (https://phabricator.wikimedia.org/T306165)
[10:11:44] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967
[10:11:46] <wikibugs>	 (03PS2) 10JMeybohm: CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968
[10:11:48] <wikibugs>	 (03PS3) 10JMeybohm: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[10:12:15] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38040/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:14:29] <icinga-wm>	 RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[10:14:53] <icinga-wm>	 RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sun 29 Jan 2023 01:19:54 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[10:15:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38788 and previous config saved to /var/cache/conftool/dbconfig/20221109-101519-ladsgroup.json
[10:15:57] <wikibugs>	 (03PS12) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981)
[10:15:59] <wikibugs>	 (03PS14) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981)
[10:16:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1018.eqiad.wmnet to cluster eqiad and group B
[10:16:29] <wikibugs>	 (03CR) 10jenkins-bot: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[10:17:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38042/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:17:33] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet
[10:17:42] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1018.eqiad.wmnet to cluster eqiad and group B
[10:22:00] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet
[10:23:51] <wikibugs>	 (03CR) 10Clément Goubert: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[10:24:01] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove netmon1002 from alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/854874 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[10:26:27] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[10:30:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P38791 and previous config saved to /var/cache/conftool/dbconfig/20221109-103026-ladsgroup.json
[10:30:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[10:30:31] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[10:30:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[10:30:43] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add netmon2002 to the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523)
[10:32:54] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38043/console" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[10:34:01] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/38043/" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[10:35:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[10:35:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967 (owner: 10JMeybohm)
[10:37:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:37:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:37:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P38792 and previous config saved to /var/cache/conftool/dbconfig/20221109-103722-ladsgroup.json
[10:37:26] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[10:37:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975
[10:38:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38793 and previous config saved to /var/cache/conftool/dbconfig/20221109-103806-ladsgroup.json
[10:38:28] <wikibugs>	 (03Merged) 10jenkins-bot: cfssl-issuer-crds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854966 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[10:39:02] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[10:39:08] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Bump wmf-stable/cfssl-issuer-crds pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/854967 (owner: 10JMeybohm)
[10:40:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 (owner: 10Giuseppe Lavagetto)
[10:40:12] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[10:45:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P38794 and previous config saved to /var/cache/conftool/dbconfig/20221109-104548-ladsgroup.json
[10:45:53] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[10:52:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This patch removes 2001 too, I _think_ that's fine as long as 2002 is put in service while 2001 is the inactive server, IMHO best to add n" [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[10:53:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] gerrit: remove gerrit-theme.js [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[10:53:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38796 and previous config saved to /var/cache/conftool/dbconfig/20221109-105313-ladsgroup.json
[10:54:14] <wikibugs>	 (03PS2) 10Hnowlan: Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104)
[10:56:44] <icinga-wm>	 RECOVERY - Ganeti memory on ganeti1013 is OK: OK Memory 89% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[10:58:03] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[10:58:12] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[10:58:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C
[10:59:52] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C
[11:00:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38797 and previous config saved to /var/cache/conftool/dbconfig/20221109-110055-ladsgroup.json
[11:04:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[11:04:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968 (owner: 10JMeybohm)
[11:04:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "The gerrit-theme.js file is gone from Puppet ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/853061 ) and would have to be manually" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853052 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:04:46] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Move test result table to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853056 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:04:53] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Move custom CSS style to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853057 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:05:01] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Move custom links to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853058 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:05:32] <wikibugs>	 (03Merged) 10jenkins-bot: Import gerrit-theme.js history from Puppet [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853052 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:05:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw
[11:05:52] <wikibugs>	 (03Merged) 10jenkins-bot: Move test result table to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853056 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:05:58] <wikibugs>	 (03Merged) 10jenkins-bot: Move custom CSS style to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853057 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:06:04] <wikibugs>	 (03Merged) 10jenkins-bot: Move custom links to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853058 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:06:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[11:07:23] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[11:08:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38798 and previous config saved to /var/cache/conftool/dbconfig/20221109-110819-ladsgroup.json
[11:08:36] <wikibugs>	 (03Merged) 10jenkins-bot: CI: Ensure the wmf-stable helm repo exists [deployment-charts] - 10https://gerrit.wikimedia.org/r/854968 (owner: 10JMeybohm)
[11:09:03] <wikibugs>	 (03CR) 10Gmodena: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[11:09:11] <wikibugs>	 (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/853305" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:09:20] <wikibugs>	 (03Merged) 10jenkins-bot: CI: only patch helmfile charts path if no version is pinned [deployment-charts] - 10https://gerrit.wikimedia.org/r/854943 (owner: 10Giuseppe Lavagetto)
[11:09:56] <wikibugs>	 (03CR) 10Clément Goubert: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[11:15:36] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[11:16:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38799 and previous config saved to /var/cache/conftool/dbconfig/20221109-111601-ladsgroup.json
[11:16:31] <wikibugs>	 (03CR) 10Hashar: "Some previous change has moved our Gerrit UI customizations to several ./plugins/*.js file and I felt like I could use eslint for them." [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar)
[11:17:08] <wikibugs>	 (03PS2) 10JMeybohm: cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486)
[11:17:10] <wikibugs>	 (03PS2) 10JMeybohm: cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136
[11:17:12] <wikibugs>	 (03PS2) 10JMeybohm: cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486)
[11:18:01] <wikibugs>	 (03PS4) 10JMeybohm: calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943)
[11:19:39] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[11:20:27] <wikibugs>	 (03Merged) 10jenkins-bot: Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[11:21:32] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet
[11:23:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38800 and previous config saved to /var/cache/conftool/dbconfig/20221109-112326-ladsgroup.json
[11:23:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[11:23:30] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[11:23:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[11:23:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38801 and previous config saved to /var/cache/conftool/dbconfig/20221109-112347-ladsgroup.json
[11:25:58] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet
[11:28:08] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-etcd1001.eqiad.wmnet
[11:28:22] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dse-k8s-etcd1001.eqiad.wmnet
[11:28:59] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-etcd1001.eqiad.wmnet
[11:29:51] <wikibugs>	 (03Abandoned) 10Hnowlan: postgres: add option to enable replication slots [puppet] - 10https://gerrit.wikimedia.org/r/755959 (https://phabricator.wikimedia.org/T290149) (owner: 10Hnowlan)
[11:31:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P38802 and previous config saved to /var/cache/conftool/dbconfig/20221109-113108-ladsgroup.json
[11:31:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:31:13] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[11:31:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:31:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[11:31:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[11:31:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[11:31:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P38803 and previous config saved to /var/cache/conftool/dbconfig/20221109-113144-ladsgroup.json
[11:32:40] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-etcd1001.eqiad.wmnet
[11:33:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad
[11:34:16] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet
[11:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:36:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[11:38:41] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet
[11:39:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P38804 and previous config saved to /var/cache/conftool/dbconfig/20221109-113948-ladsgroup.json
[11:39:52] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[11:40:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: arclamp: Add role contact information [puppet] - 10https://gerrit.wikimedia.org/r/854985
[11:43:46] <wikibugs>	 (03CR) 10Muehlenhoff: arclamp: Add role contact information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris)
[11:44:31] <wikibugs>	 (03PS2) 10Clément Goubert: deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 (owner: 10Giuseppe Lavagetto)
[11:44:44] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.maps.roll-restart: Also restart nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/854987
[11:46:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: centrallog: add first prototype of webrequest-live with Benthos (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[11:48:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline. Also no file here has a newline at the end, which it should" [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[11:48:52] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[11:49:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Also I forgot to add: there's no newline at the end of the files, though there should be I think" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[11:50:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: centrallog: add first prototype of webrequest-live with Benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[11:51:18] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:53:44] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[11:54:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38805 and previous config saved to /var/cache/conftool/dbconfig/20221109-115454-ladsgroup.json
[11:56:03] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[11:56:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[11:57:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] sre.maps.roll-restart: Also restart nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/854987 (owner: 10Muehlenhoff)
[12:00:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.maps.roll-restart: Also restart nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/854987 (owner: 10Muehlenhoff)
[12:01:12] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[12:03:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[12:03:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[12:05:02] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[12:07:00] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[12:10:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38806 and previous config saved to /var/cache/conftool/dbconfig/20221109-121001-ladsgroup.json
[12:10:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[12:11:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[12:14:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) The install of the updated trafficserver package for bullseye fails on a bullseye host with the message:  ` The following packages have unmet dependencies:  trafficserver : Depends:...
[12:14:18] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/854990
[12:16:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/854991 (https://phabricator.wikimedia.org/T135991)
[12:16:35] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@b83625a]: gerrit2002: Gerrit JavaScript plugins as standalone files # T319378
[12:16:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: DNM: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992
[12:16:39] <stashbot>	 T319378: Move Gerrit Javascript plugins from gerrit-theme.js to standalone files in the deploy repository - https://phabricator.wikimedia.org/T319378
[12:16:45] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b83625a]: gerrit2002: Gerrit JavaScript plugins as standalone files # T319378 (duration: 00m 10s)
[12:17:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DNM: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris)
[12:17:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[12:18:17] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: arclamp: Add role contact information [puppet] - 10https://gerrit.wikimedia.org/r/854985
[12:18:18] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[12:20:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: arclamp: Add role contact information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris)
[12:22:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/854990 (owner: 10Hnowlan)
[12:23:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854555 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:24:04] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:24:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38807 and previous config saved to /var/cache/conftool/dbconfig/20221109-122403-ladsgroup.json
[12:24:08] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[12:24:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:24:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I think the same case can be made for Xenon (role::webperf::processors_and_site), but could also be a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris)
[12:24:53] <hashar>	 jouncebot: now
[12:24:53] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 35 minute(s)
[12:25:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P38808 and previous config saved to /var/cache/conftool/dbconfig/20221109-122507-ladsgroup.json
[12:25:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:25:10] <hashar>	 I am going to do deploy a change to Gerrit to move our js plugins to standalone files
[12:25:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:25:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P38809 and previous config saved to /var/cache/conftool/dbconfig/20221109-122528-ladsgroup.json
[12:25:52] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@b83625a]: gerrit1001: Gerrit JavaScript plugins as standalone files # T319378
[12:25:56] <stashbot>	 T319378: Move Gerrit Javascript plugins from gerrit-theme.js to standalone files in the deploy repository - https://phabricator.wikimedia.org/T319378
[12:26:00] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[12:26:01] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b83625a]: gerrit1001: Gerrit JavaScript plugins as standalone files # T319378 (duration: 00m 09s)
[12:26:24] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[12:26:47] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/854990 (owner: 10Hnowlan)
[12:26:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[12:28:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[12:28:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans)
[12:28:58] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[12:29:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[12:30:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[12:31:04] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[12:31:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013)
[12:33:17] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[12:33:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P38810 and previous config saved to /var/cache/conftool/dbconfig/20221109-123344-ladsgroup.json
[12:33:48] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[12:33:52] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[12:39:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38811 and previous config saved to /var/cache/conftool/dbconfig/20221109-123910-ladsgroup.json
[12:39:26] <wikibugs>	 (03CR) 10Volans: "Some inline comments to simplify the code and make it a bit more modern ;)" [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris)
[12:42:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) >>! In T321309#8382745, @ssingh wrote: > I am not aware of the reasons why we build with `BACKPORT=yes` but just to confirm that there are no other differences:  Probably...
[12:42:39] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[12:43:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[12:45:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[12:45:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans)
[12:48:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[12:48:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38812 and previous config saved to /var/cache/conftool/dbconfig/20221109-124850-ladsgroup.json
[12:48:55] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 29169
[12:49:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) >>! In T321309#8382869, @MoritzMuehlenhoff wrote: >>>! In T321309#8382745, @ssingh wrote: >> I am not aware of the reasons why we build with `BACKPORT=yes` but just to confirm that t...
[12:49:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 29169
[12:50:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[12:50:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 63199
[12:51:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[12:54:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63199
[12:54:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38813 and previous config saved to /var/cache/conftool/dbconfig/20221109-125416-ladsgroup.json
[12:55:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:55:27] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[12:57:28] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] "All deployed, your canary releases now get traffic." [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert)
[13:03:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38814 and previous config saved to /var/cache/conftool/dbconfig/20221109-130357-ladsgroup.json
[13:05:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 169 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:07:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:09:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38815 and previous config saved to /var/cache/conftool/dbconfig/20221109-130923-ladsgroup.json
[13:09:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:09:28] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:09:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P38816 and previous config saved to /var/cache/conftool/dbconfig/20221109-130944-ladsgroup.json
[13:13:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet
[13:13:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P38817 and previous config saved to /var/cache/conftool/dbconfig/20221109-131351-ladsgroup.json
[13:17:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet
[13:18:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:19:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P38818 and previous config saved to /var/cache/conftool/dbconfig/20221109-131903-ladsgroup.json
[13:19:08] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:22:06] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:24:01] <moritzm>	 !log drain ganeti1013 for eventual reimage to bullseye T311687
[13:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:05] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[13:28:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38820 and previous config saved to /var/cache/conftool/dbconfig/20221109-132858-ladsgroup.json
[13:44:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38821 and previous config saved to /var/cache/conftool/dbconfig/20221109-134404-ladsgroup.json
[13:50:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: clinic-duty: update Lumen regex and tests [software] - 10https://gerrit.wikimedia.org/r/854996
[13:51:40] <godog>	 if anyone is up for an easy one and/or has used ops-maint-gcal.js
[13:56:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998
[13:58:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:59:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P38822 and previous config saved to /var/cache/conftool/dbconfig/20221109-135911-ladsgroup.json
[13:59:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[13:59:15] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:59:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[13:59:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P38823 and previous config saved to /var/cache/conftool/dbconfig/20221109-135943-ladsgroup.json
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T1400).
[14:00:04] <jouncebot>	 WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:23] <WMDE-Fisch>	 o/
[14:01:32] <WMDE-Fisch>	 I'm in a meeting. So if anyone can do the deployment for me would be nice :-)
[14:01:47] <Lucas_WMDE>	 o/
[14:01:52] <Lucas_WMDE>	 I can deploy, I think
[14:02:23] <Lucas_WMDE>	 WMDE-Fisch: would you be able to test your change on mwdebug?
[14:02:30] <WMDE-Fisch>	 Yes
[14:02:34] <Lucas_WMDE>	 ok
[14:02:42] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable show nearby feature for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal)
[14:02:45] <WMDE-Fisch>	 Prepared a tab for it ;-)
[14:03:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal)
[14:03:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P38824 and previous config saved to /var/cache/conftool/dbconfig/20221109-140351-ladsgroup.json
[14:03:52] <wikibugs>	 (03Merged) 10jenkins-bot: Enable show nearby feature for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal)
[14:04:00] <icinga-wm>	 PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:04:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:854513|Enable show nearby feature for ruwiki (T321548)]]
[14:04:09] <stashbot>	 T321548: Deploy Show Nearby feature to ruwiki - https://phabricator.wikimedia.org/T321548
[14:04:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and lilients: Backport for [[gerrit:854513|Enable show nearby feature for ruwiki (T321548)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[14:04:36] <Lucas_WMDE>	 WMDE-Fisch: ^
[14:05:39] <WMDE-Fisch>	 Lucas_WMDE: Works like a charm. Thanks. Go on!
[14:05:45] <Lucas_WMDE>	 ok!
[14:06:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998
[14:08:18] <wikibugs>	 (03PS13) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981)
[14:08:20] <wikibugs>	 (03PS15) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981)
[14:08:24] <wikibugs>	 (03CR) 10Elukey: Add a basic puppetization for Benthos (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[14:08:28] <wikibugs>	 (03CR) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[14:09:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:854513|Enable show nearby feature for ruwiki (T321548)]] (duration: 05m 42s)
[14:09:53] <stashbot>	 T321548: Deploy Show Nearby feature to ruwiki - https://phabricator.wikimedia.org/T321548
[14:09:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:10:07] <Lucas_WMDE>	 anything else to deploy?
[14:11:04] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:12:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:13:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:16:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:18:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:18:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38825 and previous config saved to /var/cache/conftool/dbconfig/20221109-141858-ladsgroup.json
[14:21:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] vopsbot: always restart the service via systemd [puppet] - 10https://gerrit.wikimedia.org/r/853939 (owner: 10Giuseppe Lavagetto)
[14:22:26] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_smokeping.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:38] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:23:21] <wikibugs>	 (03Merged) 10jenkins-bot: calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:23:30] <wikibugs>	 (03PS2) 10Elukey: aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193)
[14:24:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey)
[14:25:52] <wikibugs>	 (03PS1) 10Bking: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855002 (https://phabricator.wikimedia.org/T313842)
[14:30:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[14:30:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[14:30:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T318605)', diff saved to https://phabricator.wikimedia.org/P38826 and previous config saved to /var/cache/conftool/dbconfig/20221109-143050-ladsgroup.json
[14:30:55] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:31:57] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [software] - 10https://gerrit.wikimedia.org/r/854996 (owner: 10Filippo Giunchedi)
[14:32:39] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:32:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: update Lumen regex and tests [software] - 10https://gerrit.wikimedia.org/r/854996 (owner: 10Filippo Giunchedi)
[14:33:23] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975
[14:33:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: add new mw releases [puppet] - 10https://gerrit.wikimedia.org/r/855003
[14:34:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38827 and previous config saved to /var/cache/conftool/dbconfig/20221109-143404-ladsgroup.json
[14:35:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: also update releases files for all other deployments [puppet] - 10https://gerrit.wikimedia.org/r/854975 (owner: 10Giuseppe Lavagetto)
[14:36:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::deployment_server: add new mw releases [puppet] - 10https://gerrit.wikimedia.org/r/855003 (owner: 10Giuseppe Lavagetto)
[14:40:35] <moritzm>	 !log installing libxml2 security updates
[14:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:25] <sukhe>	 !log reprepro remove bullseye-wikimedia trafficserver: T321309
[14:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:29] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[14:44:01] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: add new component for istio 1.15.3 in bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/854947 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey)
[14:46:08] <urbanecm>	 !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=frwiki` in a tmux at mwmaint1002 (locally applied shorter MentorStore::INACTIVITY_THRESHOLD; T318457)
[14:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:12] <stashbot>	 T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457
[14:46:38] <wikibugs>	 (03PS6) 10Eevans: Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802)
[14:49:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P38828 and previous config saved to /var/cache/conftool/dbconfig/20221109-144912-ladsgroup.json
[14:49:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[14:49:17] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[14:49:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[14:49:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P38829 and previous config saved to /var/cache/conftool/dbconfig/20221109-144933-ladsgroup.json
[14:50:33] <moritzm>	 !log rolling restart of mw canaries to pick up libxml security update
[14:50:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. I wonder why the pcc run didn't happen with check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans)
[14:53:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P38830 and previous config saved to /var/cache/conftool/dbconfig/20221109-145341-ladsgroup.json
[14:56:15] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:52] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Bootstrap new AQS Cassandra nodes (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/812426 (https://phabricator.wikimedia.org/T307802) (owner: 10Eevans)
[14:57:21] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:57:22] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet
[14:58:03] <wikibugs>	 (03Abandoned) 10Bking: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855002 (https://phabricator.wikimedia.org/T313842) (owner: 10Bking)
[14:59:24] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] sslcert: refactor update-ocsp.py to Python 3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854608 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:00:26] <wikibugs>	 (03PS1) 10Bking: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855004 (https://phabricator.wikimedia.org/T313842)
[15:02:04] <moritzm>	 !log installing pixman security updates on buster
[15:02:28] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet
[15:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:42] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1101.eqiad.wmnet
[15:03:31] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet
[15:03:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38831 and previous config saved to /var/cache/conftool/dbconfig/20221109-150351-ladsgroup.json
[15:03:57] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[15:04:19] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet
[15:04:41] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:04:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:07:15] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:08:34] <sukhe>	 uhm
[15:08:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38832 and previous config saved to /var/cache/conftool/dbconfig/20221109-150848-ladsgroup.json
[15:10:28] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet
[15:10:42] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-stretch2001.codfw.wmnet
[15:11:01] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet
[15:12:43] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet
[15:15:03] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:15:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:39] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:16:12] <sukhe>	 hello
[15:16:26] <vgutierrez>	 hmm port 80 is varnish
[15:16:27] <XioNoX>	 looking
[15:16:41] <sukhe>	 so the earlier failure on cp1075 must be related
[15:16:44] <sukhe>	 I am ACKing it at least
[15:16:49] <XioNoX>	 and v6 only
[15:18:35] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2001.codfw.wmnet
[15:18:52] <sukhe>	 seems to be recovering?
[15:18:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P38833 and previous config saved to /var/cache/conftool/dbconfig/20221109-151858-ladsgroup.json
[15:18:59] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[15:19:08] <XioNoX>	 different host
[15:19:15] <sukhe>	 yep
[15:19:17] <sukhe>	 earlier it was 1075
[15:20:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:22] <vgutierrez>	 hmm let's check if pybal is having issues with those hosts too
[15:20:49] <vgutierrez>	 Nov  9 15:20:39 lvs1017 pybal[15357]: [testlb_80 IdleConnection] WARN: cp1079.eqiad.wmnet (enabled/down/not pooled): Connection to 10.64.16.22:80 failed
[15:20:50] <effie>	 here too
[15:20:50] <vgutierrez>	 yeah
[15:20:54] <vgutierrez>	 varnish is failing for some reason
[15:21:09] <jynus>	 but it is text on port 80
[15:21:19] <vgutierrez>	 jynus: and?
[15:21:45] <jynus>	 I would expect port 443 to fail earlier?
[15:22:15] <vgutierrez>	 different request path
[15:22:43] <vgutierrez>	 443 is failing too BTW
[15:22:49] <vgutierrez>	 and that's expected as soon as varnish crashes :)
[15:23:03] <XioNoX>	 not sure if relevant at this point but there is a small spike of traffic to port 80 in eqiad https://w.wiki/5wG5
[15:23:25] <arnoldokoth>	 o/ here as well
[15:23:36] <akosiaris>	 here as well
[15:23:49] <akosiaris>	 so, this is just port 80 ? 
[15:23:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38834 and previous config saved to /var/cache/conftool/dbconfig/20221109-152354-ladsgroup.json
[15:23:57] <akosiaris>	 so only the redirect to 443 ? 
[15:24:04] <jynus>	 https://grafana.wikimedia.org/goto/9RGWNJvVk?orgId=1 https://grafana.wikimedia.org/goto/a7aZHJvVz?orgId=1
[15:24:07] <akosiaris>	 ah, no just read backlog
[15:24:14] <vgutierrez>	 nope, not just port 443
[15:24:19] <vgutierrez>	 *80
[15:24:20] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet
[15:24:59] <vgutierrez>	 pybal is also flagging port 443 in IPv4 && IPv6 down
[15:25:03] <vgutierrez>	 Nov 09 15:24:27 lvs1017 pybal[15357]: [textlb_443 ProxyFetch] WARN: cp1079.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (https://healthcheck.wikimedia.org/varnish-fe), 5.001 s
[15:25:04] <vgutierrez>	 Nov 09 15:24:32 lvs1017 pybal[15357]: [textlb6_443 ProxyFetch] WARN: cp1079.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (https://healthcheck.wikimedia.org/varnish-fe), 5.001 s
[15:25:06] <jynus>	 cp1079 seems not getting traffic back
[15:25:28] <jynus>	 cp1075 however got back to normal bandwidth
[15:29:33] <icinga-wm>	 PROBLEM - Check systemd state on aqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:21] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet
[15:32:32] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet
[15:33:16] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[15:33:25] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.131.15:9042 on aqs1020 is CRITICAL: connect to address 10.64.131.15 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:33:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.22:9042 on aqs1018 is CRITICAL: connect to address 10.64.32.22 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:33:26] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.119:7001 on aqs1019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:33:53] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2002.codfw.wmnet
[15:34:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P38835 and previous config saved to /var/cache/conftool/dbconfig/20221109-153405-ladsgroup.json
[15:34:30] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[15:34:35] <icinga-wm>	 RECOVERY - Check systemd state on aqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:47] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.22:7001 on aqs1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:35:47] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.131.15:7001 on aqs1020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:38:23] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.122:9042 on aqs1019 is CRITICAL: connect to address 10.64.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:39:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P38836 and previous config saved to /var/cache/conftool/dbconfig/20221109-153901-ladsgroup.json
[15:39:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[15:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[15:39:21] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[15:39:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P38837 and previous config saved to /var/cache/conftool/dbconfig/20221109-153922-ladsgroup.json
[15:40:50] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2002.codfw.wmnet
[15:40:59] <icinga-wm>	 PROBLEM - AQS root url on aqs1021 is CRITICAL: connect to address 10.64.135.7 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:40:59] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.16.74:9042 on aqs1017 is CRITICAL: connect to address 10.64.16.74 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:40:59] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.31:9042 on aqs1018 is CRITICAL: connect to address 10.64.32.31 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:40:59] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.122:7001 on aqs1019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:41:29] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet
[15:42:03] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet
[15:42:09] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1020 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:42:15] <effie>	 vgutierrez: is everything alright from your end ?
[15:43:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P38838 and previous config saved to /var/cache/conftool/dbconfig/20221109-154330-ladsgroup.json
[15:43:35] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.32.31:7001 on aqs1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:43:35] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.16.74:7001 on aqs1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:43:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I figured out the prometheus metric prefix, the rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[15:43:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[15:44:40] <_joe_>	 btullis: any idea what's going on with cassandra on aqs?
[15:44:44] <_joe_>	 or urandom 
[15:44:54] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[15:45:06] <wikibugs>	 (03PS1) 10JMeybohm: calico: Allow calico-cni access to ipreservations [deployment-charts] - 10https://gerrit.wikimedia.org/r/855011 (https://phabricator.wikimedia.org/T307943)
[15:45:11] <elukey>	 _joe_ I think that Eric is bootstrapping new nodes
[15:45:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm)
[15:45:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm)
[15:45:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm)
[15:45:26] <_joe_>	 ah ok
[15:46:05] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.199:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.199 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:46:05] <icinga-wm>	 PROBLEM - AQS root url on aqs1020 is CRITICAL: connect to address 10.64.131.7 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:46:06] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:47:07] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[15:47:15] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:47:22] <effie>	 nvm
[15:47:25] <urandom>	 _joe_: note-to-self, downtime hosts when bootstrapping new nodes
[15:47:29] <urandom>	 (sorry...)
[15:47:46] <wikibugs>	 (03PS1) 10JMeybohm: calico: Allow different versions, drop pre bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943)
[15:47:51] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet
[15:48:43] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.16.78:9042 on aqs1017 is CRITICAL: connect to address 10.64.16.78 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:49:06] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1019 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:49:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T318605)', diff saved to https://phabricator.wikimedia.org/P38839 and previous config saved to /var/cache/conftool/dbconfig/20221109-154911-ladsgroup.json
[15:49:13] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:49:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[15:49:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[15:49:29] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:49:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T318605)', diff saved to https://phabricator.wikimedia.org/P38840 and previous config saved to /var/cache/conftool/dbconfig/20221109-154933-ladsgroup.json
[15:49:35] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1019 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:50:15] <wikibugs>	 (03Merged) 10jenkins-bot: cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm)
[15:50:17] <wikibugs>	 (03Merged) 10jenkins-bot: cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm)
[15:50:36] <wikibugs>	 (03Merged) 10jenkins-bot: cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm)
[15:51:17] <icinga-wm>	 PROBLEM - AQS root url on aqs1019 is CRITICAL: connect to address 10.64.48.147 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:51:17] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.135.14:9042 on aqs1021 is CRITICAL: connect to address 10.64.135.14 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:51:17] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.16.78:7001 on aqs1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:51:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[15:52:19] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:49] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.213:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:53:51] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:53:51] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.135.14:7001 on aqs1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:55:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] calico: Allow calico-cni access to ipreservations [deployment-charts] - 10https://gerrit.wikimedia.org/r/855011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[15:56:21] <icinga-wm>	 PROBLEM - AQS root url on aqs1018 is CRITICAL: connect to address 10.64.32.185 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:56:21] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.213:7001 on aqs1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:56:21] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:57:32] <wikibugs>	 (03CR) 10Ahmon Dancy: "Amir, I'm looking for a +1 from you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy)
[15:58:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38841 and previous config saved to /var/cache/conftool/dbconfig/20221109-155836-ladsgroup.json
[15:58:55] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:58:56] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.131.14:9042 on aqs1020 is CRITICAL: connect to address 10.64.131.14 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:58:56] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.135.15:9042 on aqs1021 is CRITICAL: connect to address 10.64.135.15 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:00:56] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Allow calico-cni access to ipreservations [deployment-charts] - 10https://gerrit.wikimedia.org/r/855011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:01:26] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:01:27] <icinga-wm>	 PROBLEM - AQS root url on aqs1017 is CRITICAL: connect to address 10.64.16.75 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[16:01:27] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.135.15:7001 on aqs1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:01:27] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.131.14:7001 on aqs1020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:03:55] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.119:9042 on aqs1019 is CRITICAL: connect to address 10.64.48.119 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:03:56] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1020 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:03:56] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:05:25] <icinga-wm>	 RECOVERY - AQS root url on aqs1017 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[16:05:55] <icinga-wm>	 RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:08:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10RobH)
[16:08:04] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans)
[16:08:07] <icinga-wm>	 RECOVERY - AQS root url on aqs1018 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[16:08:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10RobH)
[16:08:15] <wikibugs>	 (03PS2) 10Phuedx: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016)
[16:09:11] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete tool script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854969 (owner: 10Volans)
[16:10:17] <icinga-wm>	 RECOVERY - AQS root url on aqs1019 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[16:11:43] <icinga-wm>	 RECOVERY - AQS root url on aqs1020 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[16:12:17] <icinga-wm>	 RECOVERY - AQS root url on aqs1021 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[16:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38842 and previous config saved to /var/cache/conftool/dbconfig/20221109-161343-ladsgroup.json
[16:13:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.199:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.199 and port 9042: Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://phabricator.wikimedia.org/T93886
[16:13:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.213:9042 on aqs1016 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://phabricator.wikimedia.org/T93886
[16:13:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.213:7001 on aqs1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:13:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b service on aqs1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:13:59] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a CQL 10.64.16.74:9042 on aqs1017 is CRITICAL: connect to address 10.64.16.74 and port 9042: Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://phabricator.wikimedia.org/T93886
[16:14:00] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.74:7001 on aqs1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:14:00] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a service on aqs1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive eevans Bootstrapping new Cassandra nodes (T307802). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:15:52] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Only Enable LBFactory config callback in CLI in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy)
[16:16:20] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy)
[16:19:02] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump to 4.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/855020
[16:19:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump to 4.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/855020 (owner: 10Jbond)
[16:24:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38048/console" [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:28:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P38843 and previous config saved to /var/cache/conftool/dbconfig/20221109-162849-ladsgroup.json
[16:28:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:29:00] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[16:29:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:34:11] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196)
[16:35:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:37:32] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: allow for storing pson files with gzip [puppet] - 10https://gerrit.wikimedia.org/r/855025
[16:37:34] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196)
[16:38:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: allow for storing pson files with gzip [puppet] - 10https://gerrit.wikimedia.org/r/855025 (owner: 10Jbond)
[16:40:11] <wikibugs>	 (03PS1) 10Volans: doc: removed STAGED status from Netbox diagram [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/855026 (https://phabricator.wikimedia.org/T320696)
[16:41:27] <wikibugs>	 (03PS2) 10Volans: sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696)
[16:42:19] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[16:43:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET serverlessservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:44:06] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] "Image uploaded to wikitech, self-merging." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/855026 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[16:44:09] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox statuses: no more servers in staged [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/854970 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[16:44:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10Htriedman) @fgiunchedi the expiry dates from other @tmlt.io folks are correct!  With regards to NDA and final approval: I don't have access to the N...
[16:48:07] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:48:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Htriedman) @Dzahn @KFrancis Yes, that is correct, @dasm is a Tumult Labs contractor working with us on differential privacy.  With regards to final approval, @Jcross is the app...
[16:48:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:50:09] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[16:52:36] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8218
[16:52:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox: restore 1D TTL on the dyna CNAME [dns] - 10https://gerrit.wikimedia.org/r/854945 (https://phabricator.wikimedia.org/T322700) (owner: 10Volans)
[16:53:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: restore 1D TTL on the dyna CNAME [dns] - 10https://gerrit.wikimedia.org/r/854945 (https://phabricator.wikimedia.org/T322700) (owner: 10Volans)
[16:54:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[16:54:45] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans)
[16:55:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8218
[16:59:09] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: set Netbox to active [cookbooks] - 10https://gerrit.wikimedia.org/r/854961 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans)
[16:59:11] <wikibugs>	 (03Merged) 10jenkins-bot: dns: silence log for decommissioned devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/852806 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans)
[16:59:45] <wikibugs>	 (03PS1) 10Daniel Kinzler: mediawiki.org: set VE to new direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029
[17:00:51] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: log according to the configured level [deployment-charts] - 10https://gerrit.wikimedia.org/r/855024 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[17:03:10] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Add wikimediaenterprise.com rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/852202 (https://phabricator.wikimedia.org/T321804)
[17:04:01] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:04:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ncredir: Add wikimediaenterprise.com rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/852202 (https://phabricator.wikimedia.org/T321804) (owner: 10Vgutierrez)
[17:08:12] <wikibugs>	 (03PS2) 10JHathaway: aux-k8s: add BGP config for calico [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120)
[17:09:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[17:09:15] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[17:09:16] <wikibugs>	 (03CR) 10Herron: dispatch: sync user role and info from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[17:10:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Enterprise redirect for wikimediaenterprise.com to enterprise.wikimedia.com - https://phabricator.wikimedia.org/T321804 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez `vgutierrez@ncredir6001:~$ curl -L -I http://wikimediaenterprise.com  HTTP/1.1 301 Moved Perma...
[17:10:47] <wikibugs>	 (03CR) 10JHathaway: "pushed a new patch, thanks" [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[17:12:25] <wikibugs>	 (03CR) 10JHathaway: aux-k8s: add BGP config for calico (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[17:14:51] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) >>! In T188561#8381427, @EWilfong_WMF wrote: > Thanks for the feedback and requirements documentation, @Vgutierrez.  Acoustic,...
[17:16:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[17:17:02] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[17:19:00] <wikibugs>	 (03PS1) 10Hnowlan: Decode poolcounter messages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104)
[17:28:53] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[17:29:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[17:31:23] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855034 (https://phabricator.wikimedia.org/T319184)
[17:34:59] <wikibugs>	 (03PS2) 10Hnowlan: Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104)
[17:36:55] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:37:32] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Converted existing STAGED hosts to ACTIVE - volans@cumin1001 - T320696"
[17:38:07] <stashbot>	 T320696: Reduce the count of Netbox devices with incorrect status - https://phabricator.wikimedia.org/T320696
[17:38:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "hey @Cathal could you please update the switch port setup for cloudgw2002-dev (https://netbox.wikimedia.org/dcim/devices/3026/interfaces/)" [puppet] - 10https://gerrit.wikimedia.org/r/855034 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[17:39:44] <wikibugs>	 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel)
[17:40:13] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Converted existing STAGED hosts to ACTIVE - volans@cumin1001 - T320696"
[17:41:27] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:29] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:35] <inflatador>	 hmmm
[17:44:16] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10Volans) I think this is still pending, and triggers a warning in the sre.dns.netbox cookbook because has IPs with DNS Names but the host is in decom...
[17:45:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[17:46:44] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] "You are completely right." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[17:47:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[17:48:39] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: fix typo in comment [puppet] - 10https://gerrit.wikimedia.org/r/855038
[17:48:41] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: use default cni-config [puppet] - 10https://gerrit.wikimedia.org/r/855039 (https://phabricator.wikimedia.org/T321120)
[17:49:27] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:42] <wikibugs>	 (03PS1) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774)
[17:51:26] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:57] <icinga-wm>	 PROBLEM - Host labstore1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:52:19] <wikibugs>	 (03PS2) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774)
[17:52:21] <wikibugs>	 (03PS1) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673)
[17:53:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix typo in comment [puppet] - 10https://gerrit.wikimedia.org/r/855038 (owner: 10JHathaway)
[17:53:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: use default cni-config [puppet] - 10https://gerrit.wikimedia.org/r/855039 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[17:53:26] <wikibugs>	 (03PS3) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774)
[17:54:14] <wikibugs>	 (03PS2) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673)
[17:55:39] <wikibugs>	 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) We have rate limits in place for some generic UA strings:  https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m...
[17:55:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wcqs1003.eqiad.wmnet with reason: data reload
[17:56:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855042 (https://phabricator.wikimedia.org/T319184)
[17:56:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2003-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855043 (https://phabricator.wikimedia.org/T319184)
[17:56:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: hiera: cleanup per-host network overrides [puppet] - 10https://gerrit.wikimedia.org/r/855044
[17:56:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wcqs1003.eqiad.wmnet with reason: data reload
[17:57:05] <wikibugs>	 (03CR) 10Nray: [C: 04-1] "blah typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) (owner: 10Nray)
[17:57:23] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) per T316223#8381863 serviceops-core is taking this over
[17:57:34] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) 05Stalled→03Open
[17:57:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn)
[17:57:48] <wikibugs>	 (03PS3) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673)
[17:57:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn)
[17:57:58] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10Dzahn) 05Stalled→03Open per T316223#8381863 serviceops-core is taking this over
[17:58:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:59:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "hey @cathal, please review the switch config for this host at https://netbox.wikimedia.org/dcim/devices/2069/interfaces/ and +1 when done " [puppet] - 10https://gerrit.wikimedia.org/r/855042 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[17:59:55] <wikibugs>	 (03PS1) 10CDanis: Add block80 [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774)
[17:59:59] <wikibugs>	 (03PS1) 10Volans: reports: Network ignore empty DNS names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/855048 (https://phabricator.wikimedia.org/T320721)
[18:00:05] <wikibugs>	 (03PS4) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774)
[18:00:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "hey @cathal, please configure this host for single NIC https://netbox.wikimedia.org/dcim/devices/2070/interfaces/ and then +1 this patch. " [puppet] - 10https://gerrit.wikimedia.org/r/855043 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[18:00:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis)
[18:02:09] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add block80 [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis)
[18:02:41] <wikibugs>	 (03Merged) 10jenkins-bot: Add block80 [homer/mock-private] - 10https://gerrit.wikimedia.org/r/855047 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis)
[18:04:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) a:05Htriedman→03Jcross @Htriedman Thank you:) confirmed and sounds good. Let me reassign the ticket accordingly. Best, Daniel
[18:07:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10Dzahn) @fgiunchedi I asked the same about NDA coverage for Tumult Labs on T322591 and Katie replied at T322591#8377758. Looks like this is covered....
[18:08:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:10:16] <wikibugs>	 (03PS5) 10CDanis: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774)
[18:10:27] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[18:10:34] <wikibugs>	 (03PS9) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815)
[18:11:20] <wikibugs>	 (03CR) 10Vlad.shapik: Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[18:14:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis)
[18:15:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis)
[18:15:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add 'block80' terms [homer/public] - 10https://gerrit.wikimedia.org/r/855040 (https://phabricator.wikimedia.org/T322774) (owner: 10CDanis)
[18:17:01] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall)
[18:17:05] <wikibugs>	 (03PS7) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304)
[18:18:48] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Tested on netbox-next, merging to clear some errors in the report, I'll address any later comment in a subsequent patch." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/855048 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans)
[18:19:42] <wikibugs>	 (03Merged) 10jenkins-bot: reports: Network ignore empty DNS names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/855048 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans)
[18:19:44] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:20:20] <wikibugs>	 (03PS1) 10CDanis: fix missing colon [homer/public] - 10https://gerrit.wikimedia.org/r/855051
[18:20:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:20:48] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] fix missing colon [homer/public] - 10https://gerrit.wikimedia.org/r/855051 (owner: 10CDanis)
[18:21:24] <wikibugs>	 (03Merged) 10jenkins-bot: fix missing colon [homer/public] - 10https://gerrit.wikimedia.org/r/855051 (owner: 10CDanis)
[18:23:26] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[18:24:30] <wikibugs>	 (03CR) 10Vlad.shapik: Decode poolcounter messages, fix 429 error (033 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[18:24:38] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 04-1] Decode poolcounter messages, fix 429 error [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[18:25:37] <wikibugs>	 (03PS1) 10CDanis: need to specify tcp protocol? [homer/public] - 10https://gerrit.wikimedia.org/r/855052
[18:25:51] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01186 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:26:10] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] need to specify tcp protocol? [homer/public] - 10https://gerrit.wikimedia.org/r/855052 (owner: 10CDanis)
[18:26:46] <wikibugs>	 (03Merged) 10jenkins-bot: need to specify tcp protocol? [homer/public] - 10https://gerrit.wikimedia.org/r/855052 (owner: 10CDanis)
[18:28:26] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 04-1] Decode poolcounter messages, fix 429 error (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[18:32:11] <wikibugs>	 (03PS1) 10BCornwall: Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605
[18:33:13] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:33:15] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[18:34:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605 (owner: 10BCornwall)
[18:34:17] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[18:35:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:38:41] <wikibugs>	 (03PS2) 10BCornwall: Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605
[18:40:27] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:41:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:43:03] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[18:45:37] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "prometheus: Rename ats_ metrics to trafficserver_" [puppet] - 10https://gerrit.wikimedia.org/r/854605 (owner: 10BCornwall)
[18:49:23] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[18:49:38] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:00:05] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T1900)
[19:00:53] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:01:05] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:01:55] <wikibugs>	 (03PS1) 10RLazarus: homer: Don't accept a commit with an empty message [puppet] - 10https://gerrit.wikimedia.org/r/855055
[19:05:53] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:07:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:11:49] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005437 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[19:13:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[19:14:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1013.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[19:16:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the porting." [puppet] - 10https://gerrit.wikimedia.org/r/855055 (owner: 10RLazarus)
[19:19:34] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10ayounsi) The DHCP requests were making it to cloudsw1-c8 but not further. cloudsw1-c8 was not creating binding neither (so it was not processing them).  I enabled traceoptions...
[19:24:23] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:26:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Jclark-ctr) @Cmjohnson ganeti1033 had a bad cable replaced.  ganeti1034 is connected properly and has link for management
[19:29:06] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Put the netmon2002 as passive server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[19:35:12] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:35:17] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:35:19] <logmsgbot>	 !log root@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:35:25] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1023 (**FAI...
[19:37:56] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:38:01] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye
[19:46:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:47:05] <wikibugs>	 (03PS1) 10Nray: Fix TOC misaligned when max width option is disable [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162)
[19:47:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/854991 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[19:47:50] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1033
[19:47:52] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1033
[19:47:56] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1034
[19:47:58] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1034
[19:50:27] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED
[19:50:55] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:52:11] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED
[19:53:58] <wikibugs>	 (03PS1) 10Andrew Bogott: netboot cloudvirts: only preserve /srv on cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/855057 (https://phabricator.wikimedia.org/T319042)
[19:54:59] <wikibugs>	 (03PS1) 10Ottomata: Create platform-eng-deployers group for deploying airflow platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925)
[19:56:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] netboot cloudvirts: only preserve /srv on cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/855057 (https://phabricator.wikimedia.org/T319042) (owner: 10Andrew Bogott)
[19:57:13] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38054/console" [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata)
[19:59:07] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED
[19:59:28] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "This should do it, but I can't recall if we need a restart of the keyholder service." [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata)
[20:00:25] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] Create platform-eng-deployers group for deploying airflow platform_eng (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata)
[20:01:03] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED
[20:01:15] <logmsgbot>	 !log root@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:01:21] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1023 (**FAI...
[20:01:31] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:01:37] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:04:08] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED
[20:05:54] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED
[20:07:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH)
[20:10:02] <wikibugs>	 (03Abandoned) 10Nray: Enable VectorVisualEnhancementsNext flag on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855041 (https://phabricator.wikimedia.org/T322673) (owner: 10Nray)
[20:11:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "lgtm, timer and service has been created on aphlict1001" [puppet] - 10https://gerrit.wikimedia.org/r/854991 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[20:12:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] roles: add/update role contacts for aphlict,miscweb,planet,rt [puppet] - 10https://gerrit.wikimedia.org/r/853454 (owner: 10Dzahn)
[20:12:50] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED
[20:14:48] <wikibugs>	 (03PS1) 10Dzahn: phabricator: set ServiceOps-Collab as role contacts [puppet] - 10https://gerrit.wikimedia.org/r/855062
[20:16:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: set ServiceOps-Collab as role contacts [puppet] - 10https://gerrit.wikimedia.org/r/855062 (owner: 10Dzahn)
[20:18:33] <wikibugs>	 (03CR) 10Dzahn: "yea, there was a reason though why this was enabled. it was a follow-up to performance issues in the past. it has been a long time though." [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar)
[20:19:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Info: /Stage[main]/Profile::Contacts/Concat[/etc/wikimedia/contacts.yaml]" [puppet] - 10https://gerrit.wikimedia.org/r/855062 (owner: 10Dzahn)
[20:21:57] <wikibugs>	 (03CR) 10Dzahn: "@Chad remember this? https://phabricator.wikimedia.org/rOPUP4bf4122b85dcbfc2587c3a30f72eccd8d556ad8b" [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar)
[20:23:11] <wikibugs>	 (03PS2) 10Volans: sre.hosts.provision: use default if in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128)
[20:23:35] <wikibugs>	 (03PS4) 10Jon Harald Søby: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696)
[20:23:58] <logmsgbot>	 !log root@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:24:10] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1023 (**FAI...
[20:24:17] <wikibugs>	 (03CR) 10Hashar: gerrit: remove git gc aggressive (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar)
[20:26:39] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1034.mgmt.eqiad.wmnet with reboot policy FORCED
[20:26:43] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED
[20:27:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: remove git gc aggressive [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar)
[20:30:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T318605)', diff saved to https://phabricator.wikimedia.org/P38848 and previous config saved to /var/cache/conftool/dbconfig/20221109-203031-ladsgroup.json
[20:30:33] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:30:36] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[20:30:39] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye
[20:30:44] <mutante>	 !log gerrit2002 (gerrit-replica) - restarting gerrit service
[20:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:38] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata)
[20:32:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "self-merging to unblock dcops" [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans)
[20:33:51] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "can be abandoned after https://gerrit.wikimedia.org/r/c/operations/puppet/+/853061 ?" [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar)
[20:33:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10Krinkle)
[20:35:02] <mutante>	 !log gerrit1001 (gerrit) - restarting gerrit service to disable aggressive garbage collection. gerrit:854514 - T237807
[20:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:08] <stashbot>	 T237807: gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807
[20:36:36] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: use default if in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans)
[20:38:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "ack, thanks Antoine. deployed and I also did the gerrit service restart on both servers, just now on prod" [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar)
[20:41:02] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED
[20:41:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH)
[20:42:26] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage
[20:45:02] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg)
[20:45:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P38849 and previous config saved to /var/cache/conftool/dbconfig/20221109-204538-ladsgroup.json
[20:45:49] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage
[20:46:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH)
[20:51:14] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:51:33] <wikibugs>	 (03PS11) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260
[20:51:58] <wikibugs>	 (03CR) 10Dzahn: dumps/distribution: add more data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[20:52:04] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089
[20:52:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[20:52:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis)
[20:53:25] <wikibugs>	 (03CR) 10Dzahn: "hrmm. still syntax error modules/wmflib/types/dumps/mirror.pp, line: 13" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[20:53:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH)
[20:54:46] <wikibugs>	 (03CR) 10Dzahn: "brackets! [{  not  {[  :)" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[20:55:13] <wikibugs>	 (03PS12) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260
[20:55:50] <wikibugs>	 (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089
[20:58:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis)
[20:59:10] <wikibugs>	 (03CR) 10Dzahn: "parameter 'rsync_mirrors' index 5 entry 'addeddate' expects a String[1] value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[20:59:30] <wikibugs>	 (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089
[21:00:03] <wikibugs>	 (03CR) 10Dzahn: "entry 'active' expects a match for Stdlib::Yes_no = Pattern[/\A(?i:(yes|no))\z/], got 'notrightnow'   hahaha" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221109T2100).
[21:00:05] <jouncebot>	 dancy, nray, and Jhs: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:18] <nray>	 o/ im here
[21:00:33] * TheresNoTime can deploy!
[21:00:35] <Jhs>	 present
[21:00:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P38850 and previous config saved to /var/cache/conftool/dbconfig/20221109-210044-ladsgroup.json
[21:01:06] <TheresNoTime>	 Jhs: I'll start with yours :)
[21:01:11] <Jhs>	 cool :)
[21:01:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) (owner: 10Jon Harald Søby)
[21:01:32] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:01:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis)
[21:01:59] <wikibugs>	 (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089
[21:02:07] <dancy>	 o/
[21:02:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) (owner: 10Jon Harald Søby)
[21:02:28] <TheresNoTime>	 hey dancy, just doing the other config change (854618), can do yours next?
[21:02:29] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:854618|Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis (T322696)]]
[21:02:33] <stashbot>	 T322696: Interlanguage links to no.wikipedia.org on wikis that use Wikipedia as the interlanguage link target should use the language name "Norsk bokmål" instead of just "Norsk" - https://phabricator.wikimedia.org/T322696
[21:02:42] <dancy>	 TheresNoTime: ok!
[21:02:49] <logmsgbot>	 !log samtar@deploy1002 samtar and jhsoby: Backport for [[gerrit:854618|Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis (T322696)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:03:28] <TheresNoTime>	 Jhs: that's live on mwdebug, can you test? :)
[21:03:59] <Jhs>	 TheresNoTime, confirmed, looks correct on all affected wikis 👍 
[21:04:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis)
[21:04:19] <TheresNoTime>	 great, syncing
[21:05:10] <wikibugs>	 (03PS3) 10Samtar: Only Enable LBFactory config callback in CLI in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy)
[21:05:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:06:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:07:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10Ryasmeen)
[21:08:36] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:854618|Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis (T322696)]] (duration: 06m 06s)
[21:08:40] <TheresNoTime>	 Jhs: that should be live now :)
[21:08:41] <stashbot>	 T322696: Interlanguage links to no.wikipedia.org on wikis that use Wikipedia as the interlanguage link target should use the language name "Norsk bokmål" instead of just "Norsk" - https://phabricator.wikimedia.org/T322696
[21:08:47] <Jhs>	 TheresNoTime, great, thank you!
[21:08:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy)
[21:09:02] <wikibugs>	 (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089
[21:09:14] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1023.eqiad.wmnet with OS bullseye
[21:09:21] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye completed: - cloudvirt1023 (**WARN**)   - Re...
[21:09:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:09:37] <wikibugs>	 (03Merged) 10jenkins-bot: Only Enable LBFactory config callback in CLI in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy)
[21:09:50] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]]
[21:09:54] <stashbot>	 T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485
[21:10:10] <logmsgbot>	 !log samtar@deploy1002 samtar and dancy: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:10:17] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploying — starting merge" [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) (owner: 10Nray)
[21:10:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:10:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:10:23] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: remove istio mesh values [deployment-charts] - 10https://gerrit.wikimedia.org/r/855092 (https://phabricator.wikimedia.org/T321120)
[21:10:27] <TheresNoTime>	 dancy: live on mwdebug :)
[21:10:30] <dancy>	 testing...
[21:11:05] <wikibugs>	 (03PS6) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089
[21:11:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis)
[21:11:14] <wikibugs>	 (03PS2) 10Daniel Kinzler: mediawiki.org: set VE to new direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029
[21:11:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:11:30] <dancy>	 TheresNoTime: confirmed.
[21:11:37] <TheresNoTime>	 syncin'
[21:12:20] <TheresNoTime>	 nray: your patch is merging now FYI
[21:12:32] <nray>	 TheresNoTime:  thank  you!
[21:13:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/855089 (owner: 10CDanis)
[21:14:55] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] homer: Don't accept a commit with an empty message [puppet] - 10https://gerrit.wikimedia.org/r/855055 (owner: 10RLazarus)
[21:15:00] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew) 05Open→03Resolved
[21:15:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Andrew)
[21:15:32] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] (duration: 05m 41s)
[21:15:37] <stashbot>	 T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485
[21:15:44] <wikibugs>	 (03PS7) 10CDanis: No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580)
[21:15:46] <dancy>	 Gracias!
[21:15:46] <TheresNoTime>	 dancy: that's live now :)
[21:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T318605)', diff saved to https://phabricator.wikimedia.org/P38851 and previous config saved to /var/cache/conftool/dbconfig/20221109-211551-ladsgroup.json
[21:15:53] <TheresNoTime>	 yw!
[21:15:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[21:15:56] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[21:16:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[21:16:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T318605)', diff saved to https://phabricator.wikimedia.org/P38852 and previous config saved to /var/cache/conftool/dbconfig/20221109-211613-ladsgroup.json
[21:16:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:17:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:17:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:17:58] <wikibugs>	 (03PS1) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096
[21:18:06] <wikibugs>	 (03CR) 10CDanis: "PCC confirms no-op on both experimental and normal hosts https://puppet-compiler.wmflabs.org/pcc-worker1003/38062/" [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[21:18:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:19:26] <wikibugs>	 (03CR) 10Dzahn: "ArielGlenn, Hokwelum: now there would first be this more simple change to look at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/85" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[21:19:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott)
[21:19:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn)
[21:20:25] <wikibugs>	 (03PS8) 10CDanis: No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580)
[21:21:16] <TheresNoTime>	 855068 *almost* merged :)
[21:21:44] <nray>	 👍
[21:21:58] <wikibugs>	 (03CR) 10CDanis: "updated pcc still lgtm https://puppet-compiler.wmflabs.org/pcc-worker1001/38063/" [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[21:22:42] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: remove istio mesh values [deployment-charts] - 10https://gerrit.wikimedia.org/r/855092 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[21:23:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[21:23:44] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/855097
[21:25:56] <wikibugs>	 (03Merged) 10jenkins-bot: Fix TOC misaligned when max width option is disable [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) (owner: 10Nray)
[21:26:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855068 (https://phabricator.wikimedia.org/T322162) (owner: 10Nray)
[21:26:35] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:855068|Fix TOC misaligned when max width option is disable (T322162)]]
[21:26:40] <stashbot>	 T322162: [M] Table of contents misaligned with max width disabled - https://phabricator.wikimedia.org/T322162
[21:26:55] <logmsgbot>	 !log samtar@deploy1002 samtar and nray: Backport for [[gerrit:855068|Fix TOC misaligned when max width option is disable (T322162)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:26:57] <TheresNoTime>	 nray: that's live on mwdebug now, can you test?
[21:27:28] <wikibugs>	 (03PS2) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096
[21:27:36] <nray>	 TheresNoTime: thank you, which server is it on?
[21:27:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) When I attempt to pull up https://ganeti1033.mgmt.eqiad.wmnet I get 'Bad Request'   ` Bad Request  Your browser sent a request that this server could not unde...
[21:27:40] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/855097 (owner: 10Ahmon Dancy)
[21:28:02] <TheresNoTime>	 nray: use `mwdebug1001` :)
[21:28:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn)
[21:28:08] <nray>	 k, checking
[21:28:15] <wikibugs>	 (03PS3) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096
[21:28:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:28:29] <wikibugs>	 (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/855097 (owner: 10Ahmon Dancy)
[21:28:50] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855098
[21:29:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[21:29:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:29:54] <wikibugs>	 (03PS2) 10Andrew Bogott: Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312)
[21:29:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Upgrade openstack libs on Bullseye VMs to version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/855099
[21:30:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:30:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T318605)', diff saved to https://phabricator.wikimedia.org/P38853 and previous config saved to /var/cache/conftool/dbconfig/20221109-213010-ladsgroup.json
[21:30:14] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[21:31:08] <nray>	 TheresNoTime: things look good! You may proceed!
[21:31:15] <TheresNoTime>	 syncin'!
[21:32:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[21:32:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Upgrade openstack libs on Bullseye VMs to version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/855099 (owner: 10Andrew Bogott)
[21:32:31] <wikibugs>	 (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/855098
[21:32:56] <wikibugs>	 (03PS2) 10Andrew Bogott: Upgrade openstack libs on Bullseye VMs to version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/855099
[21:33:42] <nray>	 TheresNoTime: Thank you for your help!
[21:33:48] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 38 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:34:12] <TheresNoTime>	 nray: no worries! it'll take another few minutes to be live everywhere, and it's worth checking again on production proper :)
[21:35:16] <wikibugs>	 (03PS3) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580)
[21:35:24] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:855068|Fix TOC misaligned when max width option is disable (T322162)]] (duration: 08m 48s)
[21:35:29] <stashbot>	 T322162: [M] Table of contents misaligned with max width disabled - https://phabricator.wikimedia.org/T322162
[21:35:49] <TheresNoTime>	 nray: that should be live now :)
[21:35:57] <nray>	 looks good, thank you!
[21:36:55] <wikibugs>	 (03CR) 10CDanis: "PCC LGTM (matches my hand-crafted file from manual testing) https://puppet-compiler.wmflabs.org/pcc-worker1001/38065/" [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[21:37:02] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/38066/clouddumps1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:37:03] <TheresNoTime>	 :)
[21:37:08] <wikibugs>	 (03PS6) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597)
[21:37:12] <TheresNoTime>	 !log closing UTC late backport window
[21:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:50] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 12 probes of 773 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:39:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:42:49] <wikibugs>	 (03PS4) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580)
[21:43:43] <wikibugs>	 (03PS1) 10RobH: site.pp update for ganeti103[34] [puppet] - 10https://gerrit.wikimedia.org/r/855100 (https://phabricator.wikimedia.org/T314303)
[21:43:48] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "well..now it fails because it's not in cloud.yaml, whether we need it or not." [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:44:33] <wikibugs>	 (03CR) 10RobH: [C: 03+2] site.pp update for ganeti103[34] [puppet] - 10https://gerrit.wikimedia.org/r/855100 (https://phabricator.wikimedia.org/T314303) (owner: 10RobH)
[21:45:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P38854 and previous config saved to /var/cache/conftool/dbconfig/20221109-214516-ladsgroup.json
[21:45:25] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "rabbit hole alert over here :)  maybe it's "Could not find declared class openstack::nova::common::victoria::buster"" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:46:41] <wikibugs>	 (03CR) 10Dzahn: "not sure if this is transient but I just got this on an unrelated change: Could not find declared class openstack::nova::common::victoria:" [puppet] - 10https://gerrit.wikimedia.org/r/855099 (owner: 10Andrew Bogott)
[21:47:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH)
[21:47:28] <wikibugs>	 (03PS7) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597)
[21:47:46] <wikibugs>	 (03CR) 10Dzahn: "rebasing after seeing https://gerrit.wikimedia.org/r/c/operations/puppet/+/855099/" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:48:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bullseye
[21:48:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye
[21:49:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:50:58] <wikibugs>	 (03PS1) 10Eevans: Add component/gocql to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/855102 (https://phabricator.wikimedia.org/T283838)
[21:55:21] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1034.eqiad.wmnet with OS bullseye
[21:55:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed w...
[21:57:23] <wikibugs>	 (03PS1) 10RobH: adding ganeti103[34] netboot [puppet] - 10https://gerrit.wikimedia.org/r/855105 (https://phabricator.wikimedia.org/T314303)
[21:57:42] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding ganeti103[34] netboot [puppet] - 10https://gerrit.wikimedia.org/r/855105 (https://phabricator.wikimedia.org/T314303) (owner: 10RobH)
[21:58:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "compiling on C:rsync::server  (because that works, unlike C:rsync or C:rsync::quickdatacopy which is a defined type)" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[22:00:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P38855 and previous config saved to /var/cache/conftool/dbconfig/20221109-220023-ladsgroup.json
[22:01:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bullseye
[22:01:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye
[22:06:20] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1034.eqiad.wmnet with OS bullseye
[22:06:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed w...
[22:08:07] <wikibugs>	 (03CR) 10Dzahn: "I noticed contint2002 in puppet board as failed and it's because "Could not find class ::role::insetup::unowned"" [puppet] - 10https://gerrit.wikimedia.org/r/852216 (owner: 10Muehlenhoff)
[22:11:55] <wikibugs>	 (03PS1) 10Dzahn: site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147
[22:13:39] <wikibugs>	 (03PS2) 10Dzahn: site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147
[22:14:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 (owner: 10Dzahn)
[22:14:51] <wikibugs>	 (03CR) 10Dzahn: "CI: Expected one space after 'Bug:'  Phabricator: I don't care, already added it :p" [puppet] - 10https://gerrit.wikimedia.org/r/855147 (owner: 10Dzahn)
[22:15:15] <wikibugs>	 (03PS3) 10Dzahn: site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276)
[22:15:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: move contint2002 from insetup::unowned to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) (owner: 10Dzahn)
[22:15:26] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523)
[22:15:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T318605)', diff saved to https://phabricator.wikimedia.org/P38856 and previous config saved to /var/cache/conftool/dbconfig/20221109-221529-ladsgroup.json
[22:15:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[22:15:35] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[22:15:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[22:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T318605)', diff saved to https://phabricator.wikimedia.org/P38857 and previous config saved to /var/cache/conftool/dbconfig/20221109-221551-ladsgroup.json
[22:16:51] <wikibugs>	 (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/855147" [puppet] - 10https://gerrit.wikimedia.org/r/852216 (owner: 10Muehlenhoff)
[22:17:26] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38070/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[22:17:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[22:19:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "puppet runs on contint2002 again. fwiw a user Admin::Hashuser[stevemunene] was created which seems a new root user from https://phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) (owner: 10Dzahn)
[22:23:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "it's interesting how the compiler lists a host under "Hosts that compile with differences" but when you click it it claims "no change" on " [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[22:25:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[22:28:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed on various random hosts (doc2001, contint2002, gitlab1003, mw1318, mirror1001).. will also watch puppetboard in a couple mi" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[22:32:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "also checked in cloud on gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[22:36:28] <wikibugs>	 (03CR) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[22:36:45] <wikibugs>	 (03PS3) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[22:37:08] <wikibugs>	 (03PS4) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[22:39:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[22:41:19] <wikibugs>	 (03CR) 10Dzahn: "more of Could not find declared class openstack::nova::common::victoria::buster from unrelated cloud change afaict" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[22:47:34] <icinga-wm>	 PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:51:59] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bullseye
[22:52:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye
[23:00:50] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[23:03:57] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[23:04:11] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1034.eqiad.wmnet with reason: host reimage
[23:07:34] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1034.eqiad.wmnet with reason: host reimage
[23:17:03] <tzatziki>	 !log removing 1 file for legal compliance
[23:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:13] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1034.eqiad.wmnet with OS bullseye
[23:22:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed with errors: - ganeti10...
[23:43:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH)
[23:43:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) @MoritzMuehlenhoff i recall you stating the puppet run fails in the isntaller but then just re-run and its fine?  If so, ganeti1034 is ready for ya.
[23:44:11] <tzatziki>	 !log removing 2 files for legal compliance
[23:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:10] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[23:57:00] <tzatziki>	 !log removing 1 file for legal compliance
[23:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log