[00:05:06] <sukhe>	 !log disable puppet on dns4003 till we resolve the puppet failures
[00:05:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:53] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Connect - Orange, AS5511/IPv6: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:35:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:40:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:55:37] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 57, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:11:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[01:17:05] <icinga-wm>	 RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:29:05] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) @andew if the server is not in production can i take a quick look at it
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:53:43] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[01:58:21] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[02:04:05] <icinga-wm>	 PROBLEM - SSH on db1113.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:05:21] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:08:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1023.eqiad.wmnet
[02:19:50] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1023.eqiad.wmnet
[02:20:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1023.eqiad.wmnet
[02:20:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1023.eqiad.wmnet
[02:21:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1023.eqiad.wmnet
[02:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:24:21] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:59:26] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Papaul) @ayounsi @cmooney looks like we are having a situation similar to https://phabricator.wikimedia.org/T303296. The server racked in B7 is sending request to the DHCP serv...
[02:59:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1023.eqiad.wmnet
[03:05:25] <icinga-wm>	 RECOVERY - SSH on db1113.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:06:37] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:09:01] <icinga-wm>	 PROBLEM - Check systemd state on mw1316 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:31] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:16:07] <wikibugs>	 (03PS2) 10DDesouza: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328)
[03:25:35] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:00:23] <icinga-wm>	 RECOVERY - Check systemd state on mw1316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:00:06] <wikibugs>	 (03PS1) 10Marostegui: es2030: Upgrade mariadb [puppet] - 10https://gerrit.wikimedia.org/r/838288
[05:00:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P35352 and previous config saved to /var/cache/conftool/dbconfig/20221005-050018-root.json
[05:01:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2030: Upgrade mariadb [puppet] - 10https://gerrit.wikimedia.org/r/838288 (owner: 10Marostegui)
[05:09:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2030: Upgrade mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/838213
[05:09:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35353 and previous config saved to /var/cache/conftool/dbconfig/20221005-050944-root.json
[05:11:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:12:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2030: Upgrade mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/838213 (owner: 10Marostegui)
[05:13:55] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-5.5,control-mysql-5.6: Remove them [software] - 10https://gerrit.wikimedia.org/r/838534
[05:16:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-5.5,control-mysql-5.6: Remove them [software] - 10https://gerrit.wikimedia.org/r/838534 (owner: 10Marostegui)
[05:17:20] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-5.5,control-mysql-5.6: Remove them [software] - 10https://gerrit.wikimedia.org/r/838534 (owner: 10Marostegui)
[05:24:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35354 and previous config saved to /var/cache/conftool/dbconfig/20221005-052449-root.json
[05:32:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[05:33:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 62044
[05:39:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35355 and previous config saved to /var/cache/conftool/dbconfig/20221005-053954-root.json
[05:41:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:41:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:42:54] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:43:02] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:46:04] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:50:42] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 62044
[05:51:14] <icinga-wm>	 RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:55:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35356 and previous config saved to /var/cache/conftool/dbconfig/20221005-055459-root.json
[05:58:12] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:10:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35357 and previous config saved to /var/cache/conftool/dbconfig/20221005-061004-root.json
[06:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[06:25:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35358 and previous config saved to /var/cache/conftool/dbconfig/20221005-062509-root.json
[06:27:14] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka-logging1002's Kafka TLS config to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838123 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[06:27:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging1002.eqiad.wmnet with reason: Kafka PKI upgrade
[06:27:47] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging1002.eqiad.wmnet with reason: Kafka PKI upgrade
[06:30:02] <elukey>	 !log restart kafka on kafka-logging1002 to pick up the new cert+settings for PKI
[06:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:12] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10ayounsi) @Papaul ping me when you're around and I can walk you through it. TLDR is: `cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security opt...
[06:31:18] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Use license keys stored in Netbox instead of homer-private [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi)
[06:31:27] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] LibreNMS report: ignore licenses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 (owner: 10Ayounsi)
[06:32:45] <wikibugs>	 (03Merged) 10jenkins-bot: Use license keys stored in Netbox instead of homer-private [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi)
[06:33:00] <wikibugs>	 (03Merged) 10jenkins-bot: LibreNMS report: ignore licenses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 (owner: 10Ayounsi)
[06:34:36] <wikibugs>	 (03PS1) 10Elukey: Move kafka-logging1003 to the kafka PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130)
[06:36:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[06:36:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37445/console" [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[06:40:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35359 and previous config saved to /var/cache/conftool/dbconfig/20221005-064014-root.json
[06:41:15] <wikibugs>	 (03PS1) 10Elukey: role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130)
[06:41:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[06:42:55] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37446/console" [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[06:43:51] <wikibugs>	 (03CR) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff)
[06:43:57] <wikibugs>	 (03PS3) 10Muehlenhoff: netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072
[06:44:16] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:44:43] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "The pcc bit to consider is:" [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[06:55:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35360 and previous config saved to /var/cache/conftool/dbconfig/20221005-065519-root.json
[06:58:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] netops::ripeatlas::cli: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/837072 (owner: 10Muehlenhoff)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: confd: export template status as Prometheus metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi)
[07:01:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix parameter [puppet] - 10https://gerrit.wikimedia.org/r/838668
[07:02:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix parameter [puppet] - 10https://gerrit.wikimedia.org/r/838668 (owner: 10Muehlenhoff)
[07:09:34] <icinga-wm>	 PROBLEM - SSH on db1113.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:09:35] <wikibugs>	 10SRE, 10serviceops: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey)
[07:11:23] <wikibugs>	 (03CR) 10Hashar: gerrit: decouple scap and daemon users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[07:13:22] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Well done thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar)
[07:18:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka-logging1003 to the kafka PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[07:19:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[07:21:57] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] beta: Set shard count for commonswiki_file to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson)
[07:34:39] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) 05Open→03Resolved I am going to tentatively consider this fixed. It's been a month since we repooled the hosts with...
[07:45:36] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:48:11] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrus: remove cross-dc poolcounter increases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838269 (owner: 10Ebernhardson)
[07:48:36] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:49:06] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka-logging1003 to the kafka PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/838643 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[07:49:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging1003.eqiad.wmnet with reason: Kafka PKI upgrade
[07:50:11] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging1003.eqiad.wmnet with reason: Kafka PKI upgrade
[07:52:55] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrus: Drop client side connect timeout config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838276 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[07:54:19] <elukey>	 !log restart kafka on kafka-logging1003 to pick up new PKI TLS settings
[07:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:15] <wikibugs>	 (03PS2) 10Elukey: role::kafka::logging: final clean up after migrating to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838650 (https://phabricator.wikimedia.org/T300130)
[07:55:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: confd: export template status as Prometheus metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi)
[07:57:06] <wikibugs>	 (03PS3) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272)
[07:57:08] <wikibugs>	 (03PS3) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272)
[08:00:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: confd: export template status as Prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi)
[08:03:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to roll-restart (or roll-reboot) the eventschemas cluster(s) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff)
[08:05:44] <wikibugs>	 (03PS1) 10Ayounsi: Only apply the license stanza when needed [homer/public] - 10https://gerrit.wikimedia.org/r/838715 (https://phabricator.wikimedia.org/T311008)
[08:08:52] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "Tested locally with an empty inventory and inventory with no licenses." [homer/public] - 10https://gerrit.wikimedia.org/r/838715 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi)
[08:09:35] <wikibugs>	 (03Merged) 10jenkins-bot: Only apply the license stanza when needed [homer/public] - 10https://gerrit.wikimedia.org/r/838715 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi)
[08:28:17] <wikibugs>	 (03Abandoned) 10Hashar: gerrit: disable automatic plugin handling [puppet] - 10https://gerrit.wikimedia.org/r/831913 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[08:30:04] <jouncebot>	 hoo: gettimeofday() says it's time for Wikibase client unexpectedUnconnectedPage page prop format conversion. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T0830)
[08:34:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[08:39:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[08:39:46] <wikibugs>	 10SRE, 10Traffic: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10hnowlan) Late responding on this one but thanks a lot for adding this feature!
[08:57:00] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838732
[08:57:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] vcl: stop overriding cache-control header for bad title errors [puppet] - 10https://gerrit.wikimedia.org/r/837742 (https://phabricator.wikimedia.org/T316932) (owner: 10Zabe)
[08:59:57] <wikibugs>	 (03Abandoned) 10Vgutierrez: Revert "Cache Badtitle 400s for 60s in varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm)
[09:02:59] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838732 (owner: 10Hoo man)
[09:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838732 (owner: 10Hoo man)
[09:05:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1086 is OK: reload-vcl successfully ran 0h, 2 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:06:55] <moritzm>	 !log reimport ganeti 3.0.1-1~bpo10+1 to component/ganeti3 (got removed alongside via a reprepro bug/misfeature when the bullseye component was removed)
[09:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:54] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for arwiki (duration: 03m 49s)
[09:10:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:11:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[09:11:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:11:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:12:10] <icinga-wm>	 RECOVERY - SSH on db1113.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:15:02] <wikibugs>	 (03PS12) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595)
[09:15:24] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Align formatting along k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/838168 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:15:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:15:27] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Remove unused mwautopull class [puppet] - 10https://gerrit.wikimedia.org/r/838169 (https://phabricator.wikimedia.org/T284628) (owner: 10JMeybohm)
[09:17:30] <wikibugs>	 (03PS13) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595)
[09:20:05] <dcausse>	 !log restarting blazegraph on wdqs1014 (BlazegraphFreeAllocatorsDecreasingRapidly)
[09:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:54] <moritzm>	 !log upgrading ganeti/eqiad nodes to Ganeti 3 T311687
[09:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:58] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[09:23:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[09:24:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) The problem described on {T319300} may block work on some servers, but we have plenty of others to migrate, so we should have enough work to do.
[09:25:41] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838746
[09:26:05] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838746 (owner: 10Hoo man)
[09:26:51] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838746 (owner: 10Hoo man)
[09:27:42] <icinga-wm>	 PROBLEM - Host ps1-oe14-esams is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:42] <icinga-wm>	 PROBLEM - Host ps1-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:30] <icinga-wm>	 PROBLEM - Host cp3052.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:30] <icinga-wm>	 PROBLEM - Host cp3051.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:30] <icinga-wm>	 PROBLEM - Host cp3053.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:32] <icinga-wm>	 PROBLEM - Host cp3054.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:32] <icinga-wm>	 PROBLEM - Host cp3050.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:32] <icinga-wm>	 PROBLEM - Host cp3065.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:36] <hashar>	 oups
[09:28:44] <icinga-wm>	 RECOVERY - Host ps1-oe14-esams is UP: PING OK - Packet loss = 0%, RTA = 81.99 ms
[09:29:18] <icinga-wm>	 RECOVERY - Host ps1-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 81.98 ms
[09:30:20] <icinga-wm>	 PROBLEM - Host scs-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:31:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[09:31:11] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for ruwikinews (duration: 03m 39s)
[09:31:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:31:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:32:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:32:58] <icinga-wm>	 RECOVERY - Host cp3065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.69 ms
[09:33:06] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Routing loop for unused WMCS IPs in 185.15.56.0/24 - https://phabricator.wikimedia.org/T315956 (10cmooney) 05Open→03Resolved
[09:34:21] <XioNoX>	 looking
[09:34:25] <XioNoX>	 looks like mgmt issue
[09:34:30] <XioNoX>	 cc topranks 
[09:34:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[09:34:40] <volans>	 seems so at first sight
[09:34:54] <icinga-wm>	 RECOVERY - Host cp3052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.66 ms
[09:34:54] <icinga-wm>	 RECOVERY - Host cp3051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.62 ms
[09:34:54] <icinga-wm>	 RECOVERY - Host cp3053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.63 ms
[09:34:56] <icinga-wm>	 RECOVERY - Host cp3054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.66 ms
[09:34:56] <icinga-wm>	 RECOVERY - Host cp3050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.49 ms
[09:35:15] <XioNoX>	 can someone check the calendar and maint-announce?
[09:35:20] <volans>	 looking
[09:35:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) a:03cmooney Thanks @ayounsi.  Yeah 9216 was default max I had used for the VXLAN stuff originally, but 9192 is more than enough to support a 9,000 byte IP packet and allow for the VXLA...
[09:35:51] <volans>	 XioNoX: not right now
[09:36:13] <volans>	 did something just reboot?
[09:36:44] <icinga-wm>	 RECOVERY - Host scs-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 81.43 ms
[09:36:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Validate new (anycast) IPv6 /48 announcement being accepted by transits - https://phabricator.wikimedia.org/T301900 (10cmooney) 05Open→03Resolved Thanks @ayounsi.  I didn't finish checking every single one but it was accepted by all our major transits and is...
[09:37:39] <XioNoX>	 interfaces on mr1-esams to msw-oe14-esams:47 and msw-oe16-esams:47 flapped 8min ago, so most likely the msw rebooted
[09:39:56] <topranks>	 yeah bit odd... but seems to be ok now pings solid
[09:40:31] <XioNoX>	 https://www.irccloud.com/pastebin/ALQ4ejqU/
[09:40:44] <XioNoX>	 that's quite the flaps on multiple circuits/racks
[09:41:24] <XioNoX>	 so Feed X from rack oe14/16, and feed Y from oe15
[09:41:41] <XioNoX>	 that's why the mgmt switch on oe15 didn't go down
[09:42:11] <XioNoX>	 everything critical have dual power supplies, that's what saved us
[09:42:18] <topranks>	 hmm... I wonder are those feeds mixed up in the cabling perhaps?
[09:42:55] <topranks>	 i.e. on FPC 5 are the two feeds in the alternate sockets than the rest of the devices?
[09:45:26] <wikibugs>	 (03PS1) 10Ayounsi: Network MTU check, remove 9216 from allowlist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838755 (https://phabricator.wikimedia.org/T315838)
[09:46:31] <XioNoX>	 could be too
[09:47:58] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:49:15] <volans>	 also not all the hosts in the same rack alerted
[09:51:36] <hoo>	 !log Ran extensions/Wikibase/client/maintenance/PopulateUnexpectedUnconnectedPagePageProp.php for all of arwiki
[09:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:04] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of ruwikinews
[09:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:07] <wikibugs>	 (03PS10) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[10:00:38] <wikibugs>	 (03PS2) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[10:01:32] <wikibugs>	 (03PS3) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[10:03:10] <wikibugs>	 (03PS4) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[10:08:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The patch is correct, I just have a UX question but feel free to merge the patch and we can change behaviour later" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[10:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:09:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond)
[10:10:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Self merge as trivial" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838755 (https://phabricator.wikimedia.org/T315838) (owner: 10Ayounsi)
[10:11:48] <wikibugs>	 (03Merged) 10jenkins-bot: Network MTU check, remove 9216 from allowlist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838755 (https://phabricator.wikimedia.org/T315838) (owner: 10Ayounsi)
[10:11:51] <wikibugs>	 (03Merged) 10jenkins-bot: reqconfig: Add a default for git_repo and ensure its a Path [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond)
[10:17:52] <icinga-wm>	 RECOVERY - Host ripe-atlas-esams is UP: PING OK - Packet loss = 0%, RTA = 81.21 ms
[10:19:40] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[10:20:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] admin: add thumbor namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[10:21:40] <icinga-wm>	 RECOVERY - Host ripe-atlas-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 81.74 ms
[10:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[10:25:20] <wikibugs>	 (03PS1) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761
[10:26:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37447/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond)
[10:27:48] <wikibugs>	 10SRE, 10Traffic, 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez)  ATS is supposed to perform a cache_sync_dir every 60 seconds per the undocumented config setting `proxy.config.cache.dir.sync_...
[10:28:45] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: remove remaining blocklist entries [deployment-charts] - 10https://gerrit.wikimedia.org/r/838762 (https://phabricator.wikimedia.org/T274359)
[10:30:58] <wikibugs>	 (03PS2) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761
[10:31:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37448/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond)
[10:32:59] <wikibugs>	 10SRE, 10Traffic, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez)
[10:34:12] <icinga-wm>	 PROBLEM - Host cp2036 is DOWN: PING CRITICAL - Packet loss = 100%
[10:34:44] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838764
[10:35:11] <wikibugs>	 (03PS3) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761
[10:35:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37449/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond)
[10:36:23] <moritzm>	 !log installing gdk-pixbuf security updates
[10:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:30] <wikibugs>	 (03PS4) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761
[10:38:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37450/console" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (owner: 10Jbond)
[10:38:32] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838764 (owner: 10Hoo man)
[10:39:49] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838764 (owner: 10Hoo man)
[10:42:02] <wikibugs>	 10SRE, 10Product-Infrastructure-Team-Backlog, 10WMDE-TechWish-Maintenance, 10serviceops, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10awight)
[10:43:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:44:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:44:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:44:17] <vgutierrez>	 uh... we lost cp2036?
[10:44:37] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for commonswiki (duration: 03m 51s)
[10:44:50] <wikibugs>	 (03PS5) 10Jbond: P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 (https://phabricator.wikimedia.org/T319300)
[10:45:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:45:23] <wikibugs>	 (03CR) 10Hnowlan: admin: add thumbor namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[10:46:26] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for commonswiki
[10:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "genius" [puppet] - 10https://gerrit.wikimedia.org/r/838761 (https://phabricator.wikimedia.org/T319300) (owner: 10Jbond)
[10:48:29] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2036.codfw.wmnet
[10:50:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:51:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:51:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:51:10] <wikibugs>	 (03PS4) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272)
[10:51:12] <wikibugs>	 (03PS4) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272)
[10:51:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi)
[10:52:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:53:02] <vgutierrez>	 !log powercycle cp2036 - T319394
[10:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:06] <stashbot>	 T319394: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394
[10:53:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:openstack::base::neutron: dont use the legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/838761 (https://phabricator.wikimedia.org/T319300) (owner: 10Jbond)
[10:56:02] <icinga-wm>	 RECOVERY - Host cp2036 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms
[10:59:00] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[11:01:25] <vgutierrez>	 !log repool cp2036 - T319394
[11:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:30] <stashbot>	 T319394: cp2036 crashed on 2022-10-05 - https://phabricator.wikimedia.org/T319394
[11:04:13] <moritzm>	 !log running "gnt-cluster upgrade --to 3.0" for ganeti/eqiad T311687 
[11:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:18] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[11:04:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10cmooney) @Jclark-ctr there is a discrepancy with the port allocation here.  Apologies I'd been working on some input validation in Netbox to prevent thi...
[11:05:02] <XioNoX>	 if eqsin mgmt alert it's because of me
[11:06:24] <XioNoX>	 looks like it didn't :)
[11:06:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300)
[11:06:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @wiki_willy could you help us prioritizing the remaining work on eqiad? this needs to be fixed ASAP
[11:09:29] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300)
[11:09:54] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/37453/" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:10:52] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm but small clean up still needed" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:11:02] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:11:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Actually, found a problem. The sysctl expects interface/vlan syntax rather than interface.vlan, so need an additional consideration for th" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:12:02] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[11:15:15] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[11:16:22] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:18:34] <icinga-wm>	 PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1006 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[11:20:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1029.eqiad.wmnet with OS bullseye
[11:21:46] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:05] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300)
[11:22:24] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:24:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "This PCC is better: https://puppet-compiler.wmflabs.org/pcc-worker1003/37454/" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:26:13] <wikibugs>	 (03PS13) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825)
[11:27:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: P:terraform: add a new basic terraform module registry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah)
[11:28:03] <wikibugs>	 (03CR) 10Jbond: "updated" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[11:28:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[11:29:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:29:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: neutron: l3_agent: more support for new vlan naming [puppet] - 10https://gerrit.wikimedia.org/r/838771 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:33:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage
[11:33:37] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye
[11:33:38] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[11:33:41] <wikibugs>	 (03PS14) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825)
[11:37:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage
[11:38:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudnet1006: don't use legacy naming for vlan NICs [puppet] - 10https://gerrit.wikimedia.org/r/838786 (https://phabricator.wikimedia.org/T319300)
[11:41:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838786 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:42:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1006: don't use legacy naming for vlan NICs [puppet] - 10https://gerrit.wikimedia.org/r/838786 (https://phabricator.wikimedia.org/T319300) (owner: 10Arturo Borrero Gonzalez)
[11:47:44] <wikibugs>	 (03PS7) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480)
[11:49:08] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage
[11:49:22] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[11:49:48] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:49:48] <wikibugs>	 (03CR) 10Majavah: P:terraform: add a new basic terraform module registry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah)
[11:50:37] <wikibugs>	 (03PS30) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799)
[11:51:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond)
[11:52:03] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage
[11:52:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1029.eqiad.wmnet with OS bullseye
[11:53:23] <XioNoX>	 !log fix MTU between eqiad core routers and cloudsw - T315838
[11:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:27] <stashbot>	 T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838
[11:54:44] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[11:57:16] <wikibugs>	 (03PS5) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272)
[11:57:18] <wikibugs>	 (03PS5) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272)
[12:02:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1030.eqiad.wmnet with OS bullseye
[12:05:21] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper)
[12:06:18] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10Aklapper)
[12:06:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) p:05Triage→03Medium
[12:07:28] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838788
[12:10:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10MHorsey-WMF)
[12:13:58] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bullseye
[12:15:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage
[12:16:41] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:18:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage
[12:20:44] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838788 (owner: 10Hoo man)
[12:21:28] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838788 (owner: 10Hoo man)
[12:28:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:28:49] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiki/zhwiki (duration: 03m 46s)
[12:30:04] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[12:31:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1030.eqiad.wmnet with OS bullseye
[12:32:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:32:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:33:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:37:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405 (10MoritzMuehlenhoff)
[12:40:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10MoritzMuehlenhoff)
[12:41:34] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for enwiki
[12:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[12:43:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10MoritzMuehlenhoff)
[12:43:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudnet1003: decom host [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284)
[12:45:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[12:46:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10MoritzMuehlenhoff)
[12:46:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1031.eqiad.wmnet with OS bullseye
[12:47:00] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1031.eqiad.wmnet with OS bullseye
[12:47:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1031.eqiad.wmnet with OS bullseye
[12:48:39] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[12:50:30] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/838762 (https://phabricator.wikimedia.org/T274359) (owner: 10Hnowlan)
[12:52:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM (though I haven't verified the same commits / system options are applied to opensearch and logstash). Adding Cole" [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper)
[12:53:31] <TheresNoTime>	 hnowlan: https://gerrit.wikimedia.org/r/838762 will need manual deploying, right? Probably should have asked before +2ing..
[12:54:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM (not voting though as I'm not sure enough)" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking)
[12:54:10] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: remove remaining blocklist entries [deployment-charts] - 10https://gerrit.wikimedia.org/r/838762 (https://phabricator.wikimedia.org/T274359) (owner: 10Hnowlan)
[12:59:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage
[12:59:50] <vgutierrez>	 !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia # fetch HAProxy 2.4.19
[12:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10Volans) > A tentative initial name is Charon FYI It seems that's taken already in [[ https://pypi.org/search/?q=charon | PyPI ]] and there are similar ones in [[ https://packages.debian.org/search?keywor...
[13:00:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Evaluate Striker codebase - https://phabricator.wikimedia.org/T319415 (10MoritzMuehlenhoff)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:17] <vgutierrez>	 !log test HAProxy 2.4.19 in cp4026 && cp4032
[13:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:58] <Lucas_WMDE>	 o/
[13:01:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: ats: Alert on high connection/request count (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[13:03:09] <wikibugs>	 (03PS14) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595)
[13:03:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage
[13:03:30] <wikibugs>	 (03CR) 10Vgutierrez: "please note that we are no longer using ATS 8.x in production" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[13:04:08] <Lucas_WMDE>	 (looks like nothing to deploy indeed)
[13:04:15] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for zhwiki
[13:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Check annotations in alerting rules only [alerts] - 10https://gerrit.wikimedia.org/r/838797
[13:07:28] <moritzm>	 !log draining ganeti1012 T311687
[13:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:32] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[13:08:18] <wikibugs>	 (03PS1) 10David Caro: ceph.wait_for_cluster_healthy: add elapsed time too [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838799 (https://phabricator.wikimedia.org/T315339)
[13:14:24] <wikibugs>	 (03PS1) 10Hoo man: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838801
[13:14:46] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838801 (owner: 10Hoo man)
[13:15:41] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838801 (owner: 10Hoo man)
[13:18:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1031.eqiad.wmnet with OS bullseye
[13:18:32] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:18:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:19:16] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894)
[13:19:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:19:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:19:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[13:20:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:21:17] <logmsgbot>	 !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable UnconnectedPagePagePropMigrationLegacyFormat for enwiktionary/frwiki (duration: 03m 38s)
[13:22:19] <SandraEbele>	 !log deploying fix for projectview dags on airflow
[13:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:17] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894)
[13:23:42] <wikibugs>	 (03PS31) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799)
[13:23:44] <wikibugs>	 (03PS3) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799)
[13:24:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond)
[13:24:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond)
[13:25:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:26:09] <wikibugs>	 (03CR) 10Jbond: C:varnish: Rate limit hotlinking dry-run (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond)
[13:26:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Jclark-ctr) @cmooney  sorry Dac was not seated completely. all good now
[13:26:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10fnegri)
[13:26:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:26:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:27:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:30:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10fgiunchedi) > A tentative initial name is Charon, but we're happy to solicit further feedback via this task or the talk page of https://wikitech.wikimedia.org/wiki/Wikimedia_IDM  Agreed with @volans re:...
[13:32:38] <hnowlan>	 TheresNoTime: no worries, I can handle it soon :) 
[13:33:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37456/console" [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[13:36:54] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@f7a68c2]: (no justification provided)
[13:37:06] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@f7a68c2]: (no justification provided) (duration: 00m 12s)
[13:45:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10SLyngshede-WMF) There's also a Norse version https://en.wikipedia.org/wiki/M%C3%B3%C3%B0gu%C3%B0r
[13:47:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) Hey @greg Yeah, Lisa G is your manager (confirmed on Namely). So will need approval from her (@Lgruwell-WMF ) as well as @Ottomata or @odimitrijevic
[13:47:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth)
[13:49:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Remove old production ssh key for RelEng user - https://phabricator.wikimedia.org/T319274 (10Arnoldokoth) 05In progress→03Resolved
[13:52:08] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) re: `MediaWiki EtcdConfig up-to-date` over the last 90d we got ~10 floods of varying intensity, ranging...
[13:52:11] <wikibugs>	 (03PS5) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[13:52:14] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[13:52:34] <wikibugs>	 (03PS1) 10Jbond: P:bird::anycast: drop dependency [puppet] - 10https://gerrit.wikimedia.org/r/838804
[13:52:53] <wikibugs>	 (03CR) 10Btullis: "It's worth noting that the upstream Dockerfile, on which this is based, has some additional steps that I have not included here, relating " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[13:53:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37457/console" [puppet] - 10https://gerrit.wikimedia.org/r/838804 (owner: 10Jbond)
[13:55:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1032.eqiad.wmnet with OS bullseye
[13:56:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Ottomata) Approved.
[13:57:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Ottomata) Approved.
[14:02:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth)
[14:03:41] <wikibugs>	 (03PS2) 10Jbond: P:bird::anycast: drop dependency [puppet] - 10https://gerrit.wikimedia.org/r/838804
[14:04:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37458/console" [puppet] - 10https://gerrit.wikimedia.org/r/838804 (owner: 10Jbond)
[14:05:38] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a]: Regular analytics weekly train [analytics/refinery@7e16d2a]
[14:06:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:07:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:08:02] <_joe_>	 jouncebot: now and next
[14:08:03] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 51 minute(s)
[14:08:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage
[14:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:33] <sukhe>	 ^ yeah, known
[14:09:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[14:09:56] <wikibugs>	 (03CR) 10Ottomata: "Cool, ty!" [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff)
[14:11:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage
[14:13:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Sorry 😇" [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:15:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[14:15:14] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1006.eqiad.wmnet with OS bullseye
[14:15:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr)
[14:16:05] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a]: Regular analytics weekly train [analytics/refinery@7e16d2a] (duration: 10m 27s)
[14:16:22] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a]
[14:17:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:20:47] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 04m 24s)
[14:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[14:23:33] <wikibugs>	 (03PS2) 10Andrew Bogott: Make cloudnet100[56] into cloudnet nodes [puppet] - 10https://gerrit.wikimedia.org/r/835657 (https://phabricator.wikimedia.org/T316284)
[14:23:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:26:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1032.eqiad.wmnet with OS bullseye
[14:30:23] <papaul>	 !log on going maintenance on msw1-eqiad
[14:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:14] <icinga-wm>	 PROBLEM - Check systemd state on mw1434 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:42] <volans>	 _joe_: ^^ can I assume this is a race condition ?
[14:32:07] <volans>	 between the removal of the php7.2-fpm_check_restart and the icinga checks
[14:34:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8359
[14:34:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2290 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:12] <_joe_>	 volans: yes
[14:35:19] <_joe_>	 I hopped I would be fast enough to avoid that
[14:35:20] <wikibugs>	 (03PS3) 10Jbond: P:bird::anycast: drop dependency [puppet] - 10https://gerrit.wikimedia.org/r/838804
[14:35:36] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: remove php 7.2 from the servers [puppet] - 10https://gerrit.wikimedia.org/r/838085 (https://phabricator.wikimedia.org/T318894)
[14:36:26] <_joe_>	 volans: no it's the timer that is still triggered even if you undeclare it
[14:36:31] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8359
[14:36:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37459/console" [puppet] - 10https://gerrit.wikimedia.org/r/838804 (owner: 10Jbond)
[14:36:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: remove php 7.2 from the servers [puppet] - 10https://gerrit.wikimedia.org/r/838085 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto)
[14:36:49] <volans>	 ack, got it, thx
[14:36:53] <icinga-wm>	 PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:36:57] <icinga-wm>	 RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1005 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:37:35] <icinga-wm>	 PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:35] <icinga-wm>	 PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:35] <icinga-wm>	 PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:41] <icinga-wm>	 PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:57] <volans>	 oh oh... XioNoX, topranks any work there related to these alerts? ^^^
[14:38:09] <papaul>	 volans: me
[14:38:11] <XioNoX>	 volans: yes
[14:38:15] <icinga-wm>	 PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:38:30] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudnet1003.eqiad.wmnet with reason: decom
[14:38:44] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudnet1003.eqiad.wmnet with reason: decom
[14:38:51] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: decom
[14:38:53] <volans>	 yeah the ps I expected
[14:38:53] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:39:05] <volans>	 what I didn't expected were the  asw2-d-eqiad / asw2-c-eqiad
[14:39:05] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: decom
[14:39:28] <XioNoX>	 volans: it's their mgmt interfaces, those devices are L2 only
[14:39:33] <icinga-wm>	 PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:39:33] <icinga-wm>	 PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:40:20] <volans>	 right, but looked scarier that it is
[14:40:45] <icinga-wm>	 PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[16:38:01] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:39:42] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/838852 (owner: 10RLazarus)
[16:43:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838852 (owner: 10RLazarus)
[16:44:02] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:44:40] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] cumin2002: Add an hourly httpbb run against mw2271 [puppet] - 10https://gerrit.wikimedia.org/r/838852 (owner: 10RLazarus)
[16:47:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10BTullis) Thanks for all your work on this @Andrew.  I'm going to do a fleet-wide check to see if anything still references t...
[16:47:50] <cjming>	 jouncebot: now
[16:47:50] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[16:48:42] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Enable Special:Contribute on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) (owner: 10Jdlrobson)
[16:49:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Special:Contribute on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) (owner: 10Jdlrobson)
[16:51:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) (owner: 10Jdlrobson)
[16:53:57] <cjming>	 !log deployed labs-only config
[16:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:55:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:55:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:55:59] <wikibugs>	 (03PS1) 10Btullis: Add a spark-on-k8s-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730)
[16:56:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:57:36] <wikibugs>	 (03PS2) 10Btullis: Add a spark-on-k8s-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730)
[17:00:36] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro)
[17:01:38] <wikibugs>	 (03PS3) 10Btullis: Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730)
[17:01:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10ArielGlenn) Note that labstore1006 has some html dumps that didn't make it around to the other boxes, so please don't reimag...
[17:03:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.wait_for_cluster_healthy: add elapsed time too [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838799 (https://phabricator.wikimedia.org/T315339) (owner: 10David Caro)
[17:04:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro)
[17:04:39] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns4003 is OK: OK: UP (pid=23976) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[17:05:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) the 40G port in cr1-eqiad to connect asw-c2 and asw-d2  are reqady  `  papaul@re0.cr1-eqiad> show interfaces terse | match et-1/1/ et-1/1/0...
[17:06:33] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 261, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:07:02] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.wait_for_cluster_healthy: add elapsed time too [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838799 (https://phabricator.wikimedia.org/T315339) (owner: 10David Caro)
[17:12:38] <wikibugs>	 (03PS1) 10Dduvall: jwt_authorizer: Start service as configured owner/group [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501)
[17:12:55] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a]
[17:14:43] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] jwt_authorizer: Start service as configured owner/group [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[17:17:19] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 04m 24s)
[17:18:09] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a]
[17:18:16] <jinxer-wm>	 (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown
[17:18:28] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 00m 18s)
[17:18:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[17:20:21] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a]
[17:20:36] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@7e16d2a] (thin): Regular analytics weekly train THIN [analytics/refinery@7e16d2a] (duration: 00m 14s)
[17:22:11] <icinga-wm>	 RECOVERY - Check systemd state on dns4003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:16] <jinxer-wm>	 (ThanosSidecarPrometheusDown) resolved: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarPrometheusDown
[17:28:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[17:29:21] <icinga-wm>	 RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms
[17:29:45] <icinga-wm>	 RECOVERY - Host ganeti1026.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.91 ms
[17:29:45] <icinga-wm>	 RECOVERY - Host ganeti1030.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.89 ms
[17:29:47] <icinga-wm>	 RECOVERY - Host ganeti1032.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 866.07 ms
[17:29:47] <icinga-wm>	 RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms
[17:30:01] <icinga-wm>	 RECOVERY - Host an-db1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.11 ms
[17:30:01] <icinga-wm>	 RECOVERY - Host an-master1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 56.04 ms
[17:30:07] <icinga-wm>	 RECOVERY - Host an-worker1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms
[17:30:07] <icinga-wm>	 RECOVERY - Host an-worker1081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.93 ms
[17:30:08] <icinga-wm>	 RECOVERY - Host an-worker1103.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.25 ms
[17:30:08] <icinga-wm>	 RECOVERY - Host an-worker1122.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.49 ms
[17:30:09] <icinga-wm>	 RECOVERY - Host an-worker1123.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.32 ms
[17:30:10] <icinga-wm>	 RECOVERY - Host clouddb1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 14.03 ms
[17:30:15] <icinga-wm>	 RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[17:30:25] <icinga-wm>	 RECOVERY - Host aqs1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms
[17:30:25] <icinga-wm>	 RECOVERY - Host backup1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.28 ms
[17:30:29] <icinga-wm>	 RECOVERY - Host cloudmetrics1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.87 ms
[17:30:29] <icinga-wm>	 RECOVERY - Host cloudmetrics1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.82 ms
[17:30:31] <icinga-wm>	 RECOVERY - Host cp1077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.67 ms
[17:30:31] <icinga-wm>	 RECOVERY - Host cp1078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.70 ms
[17:30:33] <icinga-wm>	 RECOVERY - Host ms-be1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.57 ms
[17:30:33] <icinga-wm>	 RECOVERY - Host db1154.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.54 ms
[17:30:33] <icinga-wm>	 RECOVERY - Host db1159.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.27 ms
[17:30:33] <icinga-wm>	 RECOVERY - Host db1160.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.14 ms
[17:30:33] <icinga-wm>	 RECOVERY - Host elastic1070.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.81 ms
[17:30:33] <icinga-wm>	 RECOVERY - Host elastic1073.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.51 ms
[17:30:34] <icinga-wm>	 RECOVERY - Host elastic1071.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.52 ms
[17:30:34] <icinga-wm>	 RECOVERY - Host elastic1072.mgmt is UP: PING OK - Packet loss = 0%, RTA = 15.10 ms
[17:30:41] <icinga-wm>	 RECOVERY - Host ms-be1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms
[17:30:41] <icinga-wm>	 RECOVERY - Host ms-be1060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[17:30:45] <icinga-wm>	 RECOVERY - Host ms-fe1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[17:30:45] <icinga-wm>	 RECOVERY - Host mw1309.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms
[17:30:45] <icinga-wm>	 RECOVERY - Host mw1307.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms
[17:30:45] <icinga-wm>	 RECOVERY - Host mw1308.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms
[17:30:45] <icinga-wm>	 RECOVERY - Host mw1310.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[17:30:45] <icinga-wm>	 RECOVERY - Host mw1312.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms
[17:30:46] <icinga-wm>	 RECOVERY - Host mw1311.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms
[17:30:47] <icinga-wm>	 RECOVERY - Host ores1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.79 ms
[17:30:47] <icinga-wm>	 RECOVERY - Host parse1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[17:30:47] <icinga-wm>	 RECOVERY - Host parse1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[17:30:55] <icinga-wm>	 RECOVERY - Host puppetmaster1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.98 ms
[17:30:55] <icinga-wm>	 RECOVERY - Host prometheus1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.73 ms
[17:30:55] <icinga-wm>	 RECOVERY - Host restbase-dev1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.30 ms
[17:30:57] <icinga-wm>	 RECOVERY - Host restbase1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.80 ms
[17:30:57] <icinga-wm>	 RECOVERY - Host restbase1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.44 ms
[17:31:01] <icinga-wm>	 RECOVERY - Host stat1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 18.39 ms
[17:31:07] <icinga-wm>	 RECOVERY - Host thumbor1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.44 ms
[17:31:13] <icinga-wm>	 RECOVERY - Host wdqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.20 ms
[17:31:17] <icinga-wm>	 RECOVERY - Host kafka-main1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[17:32:05] <wikibugs>	 (03Abandoned) 10Ryan Kemper: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking)
[17:33:01] <icinga-wm>	 RECOVERY - Host an-worker1139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.60 ms
[17:33:53] <icinga-wm>	 RECOVERY - Host db1116.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms
[17:34:33] <icinga-wm>	 RECOVERY - Host krb1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms
[17:34:33] <icinga-wm>	 RECOVERY - Host kubernetes1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms
[17:34:45] <icinga-wm>	 RECOVERY - Host lvs1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[17:34:45] <icinga-wm>	 RECOVERY - Host lvs1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[17:34:55] <icinga-wm>	 RECOVERY - Host dbproxy1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.54 ms
[17:34:55] <icinga-wm>	 RECOVERY - Host druid1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms
[17:34:55] <icinga-wm>	 RECOVERY - Host mc1037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms
[17:34:55] <icinga-wm>	 RECOVERY - Host mc1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[17:35:03] <icinga-wm>	 RECOVERY - Host ganeti1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.49 ms
[17:35:15] <icinga-wm>	 RECOVERY - Host db1115.mgmt is UP: PING OK - Packet loss = 0%, RTA = 9.32 ms
[17:35:15] <icinga-wm>	 RECOVERY - Host db1096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms
[17:35:19] <icinga-wm>	 RECOVERY - Host dbprov1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.37 ms
[17:35:55] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "Resolving comment to get this out of "your turn" UI on gerrit" [cookbooks] - 10https://gerrit.wikimedia.org/r/823704 (https://phabricator.wikimedia.org/T315360) (owner: 10Ryan Kemper)
[17:36:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] ryankemper: add tmux, vim, zsh conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834369 (owner: 10Ryan Kemper)
[17:40:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[17:41:20] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808)
[17:42:59] <icinga-wm>	 PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:43:05] <icinga-wm>	 RECOVERY - AuthDNS-over-TLS Works on dns4003 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS
[17:43:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS buster
[17:43:36] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster
[17:45:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[17:46:53] <icinga-wm>	 PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) I will reboot this tomorrow morning, Oct 6th at 08:00 and we can take it from there.
[17:47:33] <sukhe>	 13:46:53 <+icinga-wm> PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:44] <sukhe>	 would have expected the cookbook to downtime it anyway, this is expected
[17:48:05] <volans>	 sukhe: that's a separate host in Icinga terms
[17:48:12] <volans>	 the hostname is '2620:0:863:1:198:35:26:7'
[17:48:16] <sukhe>	 ah! fair 
[17:48:23] <sukhe>	 but I don't remember seeing it last time
[17:48:26] <volans>	 you can though run the downtime cookbook with the option
[17:48:26] <sukhe>	 or maybe I didn't look close enough
[17:48:54] <volans>	 --force (see -h/--help for the explanation)
[17:49:10] <sukhe>	 volans: I am seeing double, can 100% be just me :P
[17:49:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:50:09] <bblack>	 yeah mostly this is an issue with how we define these things in icinga
[17:50:12] <sukhe>	 ^ expected
[17:50:32] <bblack>	 we do it on the IP for a reason, but we could also be creating some kind of dependency link so that downtiming the host affects it
[17:50:42] <bblack>	 (in some cases like this, anyways)
[17:51:41] <icinga-wm>	 PROBLEM - Recursive DNS on 198.35.26.7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:51:49] <icinga-wm>	 PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100%
[17:52:05] <icinga-wm>	 RECOVERY - Host cloudvirt1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.15 ms
[17:52:05] <icinga-wm>	 RECOVERY - Host ps1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms
[17:52:05] <icinga-wm>	 RECOVERY - Host cloudsw2-c8-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[17:52:13] <icinga-wm>	 RECOVERY - Host cloudcephosd1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.14 ms
[17:52:13] <icinga-wm>	 RECOVERY - Host an-tool1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.62 ms
[17:52:14] <icinga-wm>	 RECOVERY - Host cloudcephosd1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.05 ms
[17:52:19] <icinga-wm>	 RECOVERY - Host cloudgw1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms
[17:52:19] <icinga-wm>	 RECOVERY - Host cloudvirt1032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.37 ms
[17:52:25] <icinga-wm>	 RECOVERY - Host cloudsw1-c8-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms
[17:53:51] <icinga-wm>	 RECOVERY - Host db1131.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.47 ms
[17:54:13] <icinga-wm>	 RECOVERY - Host deploy1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.16 ms
[17:54:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:55:35] <icinga-wm>	 RECOVERY - Host cloudvirt1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms
[17:55:53] <icinga-wm>	 RECOVERY - Host cloudcephosd1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms
[17:55:58] <icinga-wm>	 RECOVERY - Host cloudvirt1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.72 ms
[17:56:07] <icinga-wm>	 RECOVERY - Host cloudcephmon1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.35 ms
[17:56:08] <icinga-wm>	 RECOVERY - Host cloudcephosd1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms
[17:56:08] <icinga-wm>	 RECOVERY - Host cloudbackup1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.12 ms
[17:56:08] <icinga-wm>	 RECOVERY - Host cloudcephosd1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.54 ms
[17:56:09] <icinga-wm>	 RECOVERY - Host cloudcephosd1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.79 ms
[17:56:10] <icinga-wm>	 RECOVERY - Host elastic1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.70 ms
[17:56:11] <icinga-wm>	 RECOVERY - Host cloudcephosd1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.16 ms
[17:56:11] <icinga-wm>	 RECOVERY - Host cloudcephosd1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.18 ms
[17:56:13] <icinga-wm>	 RECOVERY - Host cloudcephosd1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.86 ms
[17:56:14] <icinga-wm>	 RECOVERY - Host cloudcephosd1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.93 ms
[17:56:15] <icinga-wm>	 RECOVERY - Host cloudcephosd1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[17:56:15] <icinga-wm>	 RECOVERY - Host cloudvirt1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.86 ms
[17:56:16] <icinga-wm>	 RECOVERY - Host cloudvirt1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.58 ms
[17:56:18] <icinga-wm>	 RECOVERY - Host cloudvirt1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[17:56:18] <icinga-wm>	 RECOVERY - Host cloudvirt1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.86 ms
[17:56:19] <icinga-wm>	 RECOVERY - Host cloudnet1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[17:56:23] <icinga-wm>	 RECOVERY - Host ganeti1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms
[17:56:25] <icinga-wm>	 RECOVERY - Host mw1408.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms
[17:56:25] <icinga-wm>	 RECOVERY - Host mw1409.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms
[17:56:25] <icinga-wm>	 RECOVERY - Host mw1412.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms
[17:56:25] <icinga-wm>	 RECOVERY - Host mw1410.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[17:56:25] <icinga-wm>	 RECOVERY - Host mw1411.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms
[17:56:26] <icinga-wm>	 RECOVERY - Host mw1413.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms
[17:59:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:00:04] <jouncebot>	 ^demon and brennen: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T1800). nyaa~
[18:00:04] <jouncebot>	 ^demon and brennen: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T1800).
[18:01:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4003.wikimedia.org with reason: host reimage
[18:05:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4003.wikimedia.org with reason: host reimage
[18:05:57] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:07:52] <brennen>	 o/
[18:08:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott)
[18:14:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg) Approval came in via email:     > Quick approval needed for analytics-private data access >  > Lisa Seitz Gruwell <lgruwell@wikimedia.org> Wed, Oct 5, 2022 at...
[18:17:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @aborrero Just something I noticed, you may already be aware in which case ignore.    I was testing out an updated puppet to netbox import script...
[18:17:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez  Would these 2 changes work for what is needed?  If not we would have to order replacement cables longer lengths to r...
[18:17:34] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy: correct name of ip blocklist file [puppet] - 10https://gerrit.wikimedia.org/r/838867 (https://phabricator.wikimedia.org/T319313)
[18:18:07] <brennen>	 !log train 1.40.0-wmf.4 (T314193) no current blockers, rolling train to group1
[18:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:12] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[18:18:17] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838868 (https://phabricator.wikimedia.org/T314193)
[18:18:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838868 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[18:18:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] haproxy: correct name of ip blocklist file [puppet] - 10https://gerrit.wikimedia.org/r/838867 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott)
[18:19:05] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838868 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[18:22:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:23:41] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.4  refs T314193
[18:23:44] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[18:23:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:23:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:24:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:27:22] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.4  refs T314193 (duration: 03m 40s)
[18:29:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:30:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:30:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:31:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4003.wikimedia.org with OS buster
[18:31:11] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster completed: - dns4003 (...
[18:31:27] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10Performance-Team, 10Platform Engineering, 10Traffic-Icebox: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) p:05Medium→03Low
[18:31:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:35:37] <wikibugs>	 (03PS3) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[18:43:20] <icinga-wm>	 RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:43:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @jclark-ctr as long as both lvs1017 and lvs1020 don't get connectivity from the same switch on a single row is ok. So those look...
[18:45:44] <icinga-wm>	 RECOVERY - Host elastic1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.96 ms
[18:46:34] <icinga-wm>	 RECOVERY - Host an-presto1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.20 ms
[18:46:38] <icinga-wm>	 RECOVERY - Host elastic1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms
[18:47:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10serviceops-collab: Q2:rack/setup/install webperf1005.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn)
[18:52:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10RobH)
[18:52:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10RobH)
[18:56:04] <wikibugs>	 (03PS4) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[19:00:00] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy add (commented-out) debug log line [puppet] - 10https://gerrit.wikimedia.org/r/838874 (https://phabricator.wikimedia.org/T319313)
[19:07:06] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:14:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10ConfirmEdit (CAPTCHA extension), and 5 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10aaron) a:05aaron→03None
[19:19:16] <wikibugs>	 10SRE, 10API Platform: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10daniel)
[19:20:21] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) team membership confirmed per https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream  ---  @xc...
[19:20:58] <wikibugs>	 10SRE, 10API Platform: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10daniel)
[19:25:08] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) Thank you @Dzahn!  ( Side note: I have confirmed that we can make the list public if we choose to move it to Go...
[19:27:39] <wikibugs>	 (03PS1) 10AOkoth: admin: add mhorsey to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/838881 (https://phabricator.wikimedia.org/T318729)
[19:27:43] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247)
[19:28:20] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @xcollazo ITS can create the group and then give admin ship to your team so that you can self-manage it.
[19:30:18] <wikibugs>	 (03PS5) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815)
[19:30:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, confirmed in Namely, has manager approval, nitpick: add that it's for the wmf group and not other LDAP groups" [puppet] - 10https://gerrit.wikimedia.org/r/838881 (https://phabricator.wikimedia.org/T318729) (owner: 10AOkoth)
[19:30:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Arnoldokoth) 05Open→03In progress p:05Triage→03Medium
[19:30:44] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] admin: add mhorsey to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/838881 (https://phabricator.wikimedia.org/T318729) (owner: 10AOkoth)
[19:32:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[19:34:18] <wikibugs>	 (03PS6) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815)
[19:36:00] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) @Dzahn: we discussed moving the list today and there was concern on whether we could make the content of the li...
[19:39:46] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @xcollazo There are 2 possible routes you can go. Both result in your team being able to self-manage the list.  a)...
[19:43:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] haproxy add (commented-out) debug log line [puppet] - 10https://gerrit.wikimedia.org/r/838874 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott)
[19:43:30] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) Ack @Dzahn, thank you for the context and options! Will discuss with team and get back to you.
[19:46:53] <wikibugs>	 (03PS1) 10BCornwall: prometheus: Remove ATS 8-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/838886
[19:47:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10Arnoldokoth) ` aokoth@mwmaint1002:~$ ldapsearch -x cn=wmf | grep "mhorsey" member: uid=mhorsey,ou=people,dc=wikimedia,dc=org `  This is now resolved. Feel free to close the ticket @MHorsey-WMF
[19:51:25] <wikibugs>	 (03CR) 10BCornwall: ats: Alert on high connection/request count (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[19:54:26] <wikibugs>	 (03PS1) 10Jdlrobson: Move horizontal padding from .mw-body to .mw-page-container, improve .mw-page-container styles [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573)
[19:54:48] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Move horizontal padding from .mw-body to .mw-page-container, improve .mw-page-container styles [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) (owner: 10Jdlrobson)
[19:55:08] <wikibugs>	 (03PS2) 10Jdlrobson: EXPECTED VISUAL CHANGES IN WMF.4 [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573)
[19:56:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37463/registry2004.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[19:56:55] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@09eb565]: T319461 and cleanup
[19:56:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Arnoldokoth) 05Open→03In progress p:05Triage→03Medium
[19:56:59] <stashbot>	 T319461: Add "last updated" timestamp to test coverage index pages - https://phabricator.wikimedia.org/T319461
[19:57:05] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@09eb565]: T319461 and cleanup (duration: 00m 10s)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221005T2000).
[20:00:05] <jouncebot>	 danisztls and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:22] <wikibugs>	 (03PS1) 10AOkoth: admin: add kindrobot to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/838894 (https://phabricator.wikimedia.org/T318626)
[20:00:38] <urbanecm>	 I can deploy!
[20:01:26] <urbanecm>	 i don't see danisztls here?
[20:01:28] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "[cumin2002:~] $ sudo cumin 'C:jwt_authorizer' 'date'" [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[20:02:19] <urbanecm>	 hi danisztls 
[20:02:19] <danisztls>	 o/
[20:02:27] <danisztls>	 urbanecm: hi
[20:03:18] <mutante>	 !log registry* (4 servers) - disabling puppet, deploying gerrit:838859 - T308501
[20:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:23] <stashbot>	 T308501: Authenticate trusted runners for registry access against GitLab using temporary JSON Web Token - https://phabricator.wikimedia.org/T308501
[20:03:32] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] jwt_authorizer: Start service as configured owner/group [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[20:03:48] <wikibugs>	 (03PS7) 10Urbanecm: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[20:03:51] <dduvall>	 mutante: ty :)
[20:03:53] <wikibugs>	 (03PS3) 10DDesouza: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328)
[20:03:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[20:03:57] <dancy>	 yay!
[20:04:42] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[20:05:05] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:834044|Deploy Research Incentive survey on eswiki (T318331)]]
[20:05:10] <stashbot>	 T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331
[20:05:13] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[20:05:14] <mutante>	 dduvall: dancy: deployed on registry1003.. now others.. in progress
[20:05:31] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:834044|Deploy Research Incentive survey on eswiki (T318331)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[20:05:37] * dduvall holds breath
[20:05:43] * dancy twitches
[20:05:47] <urbanecm>	 danisztls: your patch is at mwdebug1001 (and others), can you check?
[20:05:53] <danisztls>	 urbanecm: yes
[20:06:42] <urbanecm>	 okay, waiting :)
[20:07:04] <danisztls>	 urbanecm: eswiki good, arwiki still seeing survey
[20:07:13] <urbanecm>	 so, sync? :)
[20:07:33] <danisztls>	 yes
[20:07:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:07:54] <urbanecm>	 okay, doing :)
[20:08:12] <mutante>	 dduvall: dancy: 
[20:08:13] <mutante>	 (4) registry[2003-2004].codfw.wmnet,registry[1003-1004].eqiad.wmnet                                     
[20:08:16] <mutante>	 ----- OUTPUT of 'ps aux | grep jw...| cut -f1 -d " "' -----                                             
[20:08:19] <mutante>	 www-data   
[20:08:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:08:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:08:42] <dduvall>	 mutante: nice. and the ownership of `/var/run/nginx-auth/jwt.sock`?
[20:08:48] <dancy>	 one step closer.
[20:09:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "after deploying the process runs as www-data on 4 hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/838859 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[20:09:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] EXPECTED VISUAL CHANGES IN WMF.4 [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) (owner: 10Jdlrobson)
[20:09:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:10:24] <mutante>	 dduvall: www-data www-data 
[20:10:46] <dancy>	 Let the buildings begin!!
[20:10:47] <dancy>	 maybe
[20:10:48] <dduvall>	 excellent! thank you
[20:10:56] <dduvall>	 i see new errors :)
[20:11:05] <mutante>	 that's always good :)
[20:11:10] <dancy>	 dzahn, https://meet.google.com/sut-zxhw-jqy ?
[20:11:12] <dduvall>	 i'll head over to #wikimedia-gitlab
[20:11:25] <wikibugs>	 (03PS4) 10Urbanecm: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza)
[20:11:30] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza)
[20:11:57] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:834044|Deploy Research Incentive survey on eswiki (T318331)]] (duration: 06m 51s)
[20:12:02] <stashbot>	 T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331
[20:12:43] <urbanecm>	 first patch's deployed
[20:12:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza)
[20:13:42] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Research Incentive survey from arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837695 (https://phabricator.wikimedia.org/T318328) (owner: 10DDesouza)
[20:14:05] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:837695|Remove Research Incentive survey from arwiki (T318328)]]
[20:14:09] <stashbot>	 T318328: Deploy Research Incentive Survey on Arabic Wikipedia - https://phabricator.wikimedia.org/T318328
[20:14:28] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:837695|Remove Research Incentive survey from arwiki (T318328)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[20:14:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me, confirmed in Namely" [puppet] - 10https://gerrit.wikimedia.org/r/838894 (https://phabricator.wikimedia.org/T318626) (owner: 10AOkoth)
[20:14:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:14:48] <urbanecm>	 danisztls: can you check the second patch please?
[20:14:53] <danisztls>	 urbanecm: sure
[20:15:03] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett)
[20:15:07] <danisztls>	 urbanecm: good
[20:15:16] <urbanecm>	 okay, syncing
[20:15:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:15:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:16:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:19:16] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup
[20:19:19] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:837695|Remove Research Incentive survey from arwiki (T318328)]] (duration: 05m 13s)
[20:19:24] <stashbot>	 T318328: Deploy Research Incentive Survey on Arabic Wikipedia - https://phabricator.wikimedia.org/T318328
[20:19:43] <urbanecm>	 danisztls: and the other patch's done!
[20:19:54] <danisztls>	 urbanecm: thanks!
[20:19:56] <urbanecm>	 ebernhardson: hi, if you want to self-service, feel free to go ahead!
[20:19:59] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 42s)
[20:20:04] <urbanecm>	 i can also deploy for you if you want me to.
[20:21:17] <Reedy>	 Did it though?
[20:21:31] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[20:21:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:22:06] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup
[20:22:37] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 31s)
[20:22:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:22:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:22:50] <Reedy>	 How did it finish if it rolled back?
[20:22:55] <wikibugs>	 (03Merged) 10jenkins-bot: sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[20:23:55] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup
[20:24:01] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 06s)
[20:25:02] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:25:23] <sukhe>	 !log homer "cr*-ulsfo*" commit "Gerrit 838239: sites.yaml: add dns4003 to anycast_neighbors"
[20:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:53] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] admin: add kindrobot to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/838894 (https://phabricator.wikimedia.org/T318626) (owner: 10AOkoth)
[20:26:34] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup
[20:26:36] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update CNAME for ntp.ulsfo to dns4003 [dns] - 10https://gerrit.wikimedia.org/r/838882 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[20:26:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:26:45] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 00m 10s)
[20:27:22] <sukhe>	 !log running authdns-update for CR 838882
[20:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10Arnoldokoth) ` aokoth@mwmaint1002:~$ ldapsearch -x cn=wmf | grep "kindrobot" member: uid=kindrobot,ou=people,dc=wikimedia,dc=org `  This is resolved now. Feel free to close...
[20:30:58] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:31:24] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:31:46] <James_F>	 It's still on scap/sync/2022-08-19/0001
[20:32:09] <James_F>	 https://sal.toolforge.org/log/FLHtsYIBa_6PSCT9m3mW
[20:32:18] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: More minor cleanup
[20:32:31] <James_F>	 Bah, wrong channel.
[20:33:23] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: More minor cleanup (duration: 01m 05s)
[20:34:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth)
[20:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:38:28] <wikibugs>	 (03PS1) 10AOkoth: admin: add sstefanova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807)
[20:38:55] <wikibugs>	 (03PS2) 10AOkoth: admin: add sstefanova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807)
[20:40:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth)
[20:41:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth)
[20:45:09] <wikibugs>	 (03PS1) 10AOkoth: admin: add gjg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873)
[20:45:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, has approval on ticket from ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807) (owner: 10AOkoth)
[20:46:02] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] admin: add sstefanova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838900 (https://phabricator.wikimedia.org/T318807) (owner: 10AOkoth)
[20:48:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) @Slst2020 This is resolved. Feel free to close the ticket if everything is good on your end.
[20:48:21] <wikibugs>	 (03PS2) 10AOkoth: admin: add gjg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873)
[20:49:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873) (owner: 10AOkoth)
[21:01:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn)
[21:02:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312)
[21:02:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312)
[21:02:34] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312)
[21:02:36] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312)
[21:02:38] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312)
[21:02:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312)
[21:03:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[21:03:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[21:06:42] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312)
[21:06:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312)
[21:06:46] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312)
[21:06:48] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312)
[21:06:50] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312)
[21:06:52] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312)
[21:11:18] <wikibugs>	 (03CR) 10Dzahn: "@claime Would like to chat about this one" [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn)
[21:11:29] <wikibugs>	 (03PS2) 10Dzahn: scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207)
[21:13:16] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] admin: add gjg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/838902 (https://phabricator.wikimedia.org/T318873) (owner: 10AOkoth)
[21:15:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10Arnoldokoth) @greg This is resolved. Feel free to close the ticket if everything is good on your side.
[21:16:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab_runner: enable unprivileged_userns_clone in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto)
[21:16:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed since it's already cherry-picked and multiple +1s" [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto)
[21:18:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) Hey @Ottomata @karapayneWMDE Kindly approve.
[21:18:25] <wikibugs>	 (03CR) 10Dzahn: "thanks @akosiaris!  @claime This is the one I would like to deploy first." [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:19:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth)
[21:25:01] <wikibugs>	 (03PS1) 10BCornwall: prometheus: Add records for ATS percent usage [puppet] - 10https://gerrit.wikimedia.org/r/838911 (https://phabricator.wikimedia.org/T292815)
[21:26:06] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:27:58] <wikibugs>	 (03PS1) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[21:28:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[21:29:50] <wikibugs>	 (03PS2) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[21:30:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[21:33:55] <wikibugs>	 (03PS3) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[21:39:36] <wikibugs>	 (03PS1) 10Dzahn: lower TTL for gerrit,gerrit-replica from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838915 (https://phabricator.wikimedia.org/T315319)
[21:41:00] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.26.0" for 559 hosts
[21:41:02] <wikibugs>	 (03PS1) 10Dzahn: lower TTL for phabricator from 600 to 300 [dns] - 10https://gerrit.wikimedia.org/r/838916 (https://phabricator.wikimedia.org/T315319)
[21:41:17] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.26.0" completed for 559 hosts
[21:45:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Ottomata) Approved!
[21:47:50] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37464/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[22:15:14] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:02] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.27.0" for 559 hosts
[22:17:19] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.27.0" completed for 559 hosts
[22:17:52] <logmsgbot>	 !log dancy@deploy1002 Started deploy [integration/docroot@a136ce6]: (no justification provided)
[22:18:03] <logmsgbot>	 !log dancy@deploy1002 Finished deploy [integration/docroot@a136ce6]: (no justification provided) (duration: 00m 10s)
[22:19:20] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: Cleanup and timestamps
[22:19:43] <logmsgbot>	 !log reedy@deploy1002 deploy aborted: Cleanup and timestamps (duration: 00m 22s)
[22:21:11] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: (no justification provided)
[22:21:17] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: (no justification provided) (duration: 00m 06s)
[22:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:23:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:27:14] <logmsgbot>	 !log reedy@deploy1002 Started deploy [integration/docroot@a136ce6]: Cleanup and timestamps
[22:27:22] <logmsgbot>	 !log reedy@deploy1002 Finished deploy [integration/docroot@a136ce6]: Cleanup and timestamps (duration: 00m 07s)
[22:31:22] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:34:02] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:41:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:41:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Greg Grossmeier - https://phabricator.wikimedia.org/T318873 (10greg) 05Open→03Resolved a:03Arnoldokoth Looks to be working, thanks @Arnoldokoth !
[22:46:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:29:24] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:24] <wikibugs>	 (03PS1) 10BryanDavis: php74: add many TTF fonts [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435)
[23:59:46] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] "In local testing this change adds about 400MiB to the uncompressed image size. That's about a 50% increase in total size, but I think that" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/838939 (https://phabricator.wikimedia.org/T310435) (owner: 10BryanDavis)