[00:03:25] <icinga-wm>	 PROBLEM - Host cloudcephmon1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[00:03:55] <icinga-wm>	 PROBLEM - Host cloudvirt1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[00:04:07] <icinga-wm>	 PROBLEM - Host cp1081.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[00:06:19] <icinga-wm>	 PROBLEM - Host lvs1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[00:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:19:59] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:20:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:26:35] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:27:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:30:15] <icinga-wm>	 RECOVERY - Host cloudcephmon1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms
[00:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:37:33] <wikibugs>	 (03CR) 10Ori: [C: 03+1] systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[00:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:39:05] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:45:59] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:38] <icinga-wm>	 PROBLEM - Host ores1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:46:42] <icinga-wm>	 PROBLEM - Host mw1316.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:46:44] <icinga-wm>	 PROBLEM - Host mw1314.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:46:44] <icinga-wm>	 PROBLEM - Host an-worker1098.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:52] <icinga-wm>	 PROBLEM - Host analytics1073.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:48:08] <icinga-wm>	 PROBLEM - Host ps1-b7-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[01:48:16] <icinga-wm>	 PROBLEM - Host ms-be1041.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:48:18] <icinga-wm>	 PROBLEM - Host ms-be1053.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:49:42] <icinga-wm>	 PROBLEM - Host clouddumps1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:49:43] <icinga-wm>	 PROBLEM - Host cloudcephosd1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:49:48] <icinga-wm>	 PROBLEM - Host cloudvirt1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:49:52] <icinga-wm>	 PROBLEM - Host cloudvirt1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:50:10] <icinga-wm>	 PROBLEM - Host cp1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:50:12] <icinga-wm>	 PROBLEM - Host elastic1086.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:50:12] <icinga-wm>	 PROBLEM - Host elastic1085.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:50:32] <icinga-wm>	 PROBLEM - Host dbprov1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:50:46] <icinga-wm>	 PROBLEM - Host restbase-dev1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:02] <icinga-wm>	 PROBLEM - Host an-worker1087.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:12] <icinga-wm>	 PROBLEM - Host an-worker1130.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:36] <icinga-wm>	 PROBLEM - Host kafka-main1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:56] <icinga-wm>	 PROBLEM - Host clouddb1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:52:08] <icinga-wm>	 PROBLEM - Host lvs1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:52:10] <icinga-wm>	 PROBLEM - Host mw1313.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:52:10] <icinga-wm>	 PROBLEM - Host mw1315.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[02:03:07] <icinga-wm>	 RECOVERY - Host kafka-main1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[02:03:37] <icinga-wm>	 RECOVERY - Host lvs1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:21:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[02:29:51] <icinga-wm>	 RECOVERY - Host cloudcephosd1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms
[02:30:25] <icinga-wm>	 RECOVERY - Host elastic1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.10 ms
[02:30:25] <icinga-wm>	 RECOVERY - Host elastic1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms
[02:30:31] <icinga-wm>	 RECOVERY - Host dbprov1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.10 ms
[02:30:55] <icinga-wm>	 RECOVERY - Host an-worker1130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[02:31:13] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:04:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[03:04:24] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312)
[03:20:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[03:21:07] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312)
[03:21:22] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312)
[03:24:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[04:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[04:07:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[04:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[04:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[04:36:05] <icinga-wm>	 PROBLEM - DNS on lvs1018.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:39:45] <icinga-wm>	 PROBLEM - DNS on elastic1086.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.223 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:39:45] <icinga-wm>	 PROBLEM - DNS on elastic1085.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:46:51] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:51:43] <icinga-wm>	 PROBLEM - DNS on an-worker1130.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.156 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:56:53] <icinga-wm>	 PROBLEM - DNS on kafka-main1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.130 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:04:23] <icinga-wm>	 PROBLEM - DNS on dbprov1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.18 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:34:57] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:16:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4826
[06:17:57] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4826
[06:36:07] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:57:47] <wikibugs>	 (03PS2) 10Muehlenhoff: opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838833 (https://phabricator.wikimedia.org/T308013)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T0700). Please do the needful.
[07:00:05] <jouncebot>	 matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:08] <matthiasmullie>	 o/
[07:00:14] <urbanecm>	 o/
[07:00:28] <urbanecm>	 i can deploy, unless matthiasmullie wants to self-serve?
[07:00:43] <matthiasmullie>	 either works for me :p
[07:01:31] <urbanecm>	 matthiasmullie: go ahead then :D
[07:01:39] <matthiasmullie>	 starting!
[07:01:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838833 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:01:57] <urbanecm>	 matthiasmullie: fyi, we've a new deployment tool. `scap backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841515` will take care of everything for you
[07:01:59] <matthiasmullie>	 oh, right, new scap scripts!
[07:02:25] <urbanecm>	 yep yep
[07:02:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841515 (https://phabricator.wikimedia.org/T320406) (owner: 10Matthias Mullie)
[07:03:19] <wikibugs>	 (03PS2) 10Muehlenhoff: maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013)
[07:05:08] <matthiasmullie>	 urbanecm: out of curiosity - I notice scap now handles merging the patch as well; what happens with patches that are already merged, or already +2ed and being merged soon?
[07:05:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey) Same thing this morning:  ` elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "an-worker1086.mgmt.eqiad.wmnet" -U root -E chassis power status Unable to read password from environment...
[07:05:48] <matthiasmullie>	 Asking because some repos have CI that takes forever and I often +2ed half an hour in advance so it doesn't take up most of the deployment window
[07:09:07] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:19:14] <wikibugs>	 (03Merged) 10jenkins-bot: Rescale images based on width alone [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841515 (https://phabricator.wikimedia.org/T320406) (owner: 10Matthias Mullie)
[07:19:49] <logmsgbot>	 !log mlitn@deploy1002 Started scap: Backport for [[gerrit:841515|Rescale images based on width alone (T320406)]]
[07:19:54] <stashbot>	 T320406: Thumbnails on SpecialSearch may fail to load - https://phabricator.wikimedia.org/T320406
[07:20:19] <logmsgbot>	 !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:841515|Rescale images based on width alone (T320406)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[07:25:09] <logmsgbot>	 !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:841515|Rescale images based on width alone (T320406)]] (duration: 05m 19s)
[07:25:14] <stashbot>	 T320406: Thumbnails on SpecialSearch may fail to load - https://phabricator.wikimedia.org/T320406
[07:25:46] <matthiasmullie>	 !log UTC morning backports done
[07:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:38:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:40:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add a new production images for spark and spark-operator (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[07:46:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:51:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:55:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Sounds right." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841477 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[07:59:10] <wikibugs>	 10SRE, 10GitLab, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto)
[08:01:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2001.codfw.wmnet to plain
[08:02:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2001.codfw.wmnet to plain
[08:02:34] <wikibugs>	 (03PS2) 10Muehlenhoff: logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013)
[08:04:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:07:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2001.codfw.wmnet to drbd
[08:08:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: clean up ganeti4001 references [puppet] - 10https://gerrit.wikimedia.org/r/841853
[08:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[08:11:07] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[08:16:05] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2001.codfw.wmnet to drbd
[08:18:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1004.eqiad.wmnet to drbd
[08:26:09] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #5 [puppet] - 10https://gerrit.wikimedia.org/r/841856 (https://phabricator.wikimedia.org/T317748)
[08:27:59] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841134 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[08:28:21] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Make ganeti1032 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841127 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff)
[08:28:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1004.eqiad.wmnet to drbd
[08:30:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. 4003 will also be taken down soonish, but by then a replacement host in this rack should be present." [puppet] - 10https://gerrit.wikimedia.org/r/841853 (owner: 10Filippo Giunchedi)
[08:31:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "*nod* thanks for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/841853 (owner: 10Filippo Giunchedi)
[08:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[08:33:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd
[08:34:26] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37512/console" [puppet] - 10https://gerrit.wikimedia.org/r/841856 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[08:36:57] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #5 [puppet] - 10https://gerrit.wikimedia.org/r/841856 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[08:37:27] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Investigate issue with msw-b7-eqiad - https://phabricator.wikimedia.org/T320598 (10cmooney) p:05Triage→03Medium
[08:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:38:48] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Investigate issue with msw-b7-eqiad - https://phabricator.wikimedia.org/T320598 (10cmooney)
[08:42:28] <wikibugs>	 (03CR) 10Awight: [C: 03+1] Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[08:43:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd
[08:43:41] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:43:59] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms
[08:48:57] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-etcd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[08:50:22] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on restbase-dev1005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:50:22] <icinga-wm>	 ACKNOWLEDGEMENT - Host restbase-dev1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04.
[08:50:22] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-B-phase-Z on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:50:22] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-B-phase-Y on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:50:22] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-B-phase-X on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:50:22] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-A-phase-Z on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:52:02] <vgutierrez>	 !log partitioning the ATS cache in cp[2033-2034], cp[6003,6011], cp[1081-1082], cp[5004,5010], cp[3056-3057], cp[4024,4028] - T317748
[08:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:07] <stashbot>	 T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748
[08:52:11] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on elastic1085.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.222 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:52:11] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on elastic1086.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.223 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:52:11] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on kafka-main1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.130 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:52:11] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on lvs1018.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.209 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:52:11] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:53:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1004.eqiad.wmnet to plain
[08:54:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1004.eqiad.wmnet to plain
[08:54:45] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on an-worker1130.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.156 ayounsi https://phabricator.wikimedia.org/T320598 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:54:45] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on dbprov1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.18 ayounsi https://phabricator.wikimedia.org/T320598 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:56:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (0320 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[08:56:48] <hoo>	 _joe_: From my side https://gerrit.wikimedia.org/r/c/operations/puppet/+/841148 is ready to be merged now
[08:57:02] <hoo>	 I'm not entirely sure my rebase is correct
[08:58:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain
[08:59:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain
[09:00:23] <_joe_>	 hoo: I'll take a look when I have a minute, thanks
[09:01:27] <hoo>	 Thanks :)
[09:02:15] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:03] <wikibugs>	 (03PS2) 10Urbanecm: SVG resources: Run svgo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841187 (https://phabricator.wikimedia.org/T320447)
[09:05:25] <urbanecm>	 jouncebot: nowandnext
[09:05:25] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 54 minute(s)
[09:05:25] <jouncebot>	 In 3 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1300)
[09:05:28] <jayme>	 !log disabling puppet on all kubernetes masters (incl. ml & dse)
[09:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841187 (https://phabricator.wikimedia.org/T320447) (owner: 10Urbanecm)
[09:06:05] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master remove apiserver_count [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:06:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:06:21] <wikibugs>	 (03PS5) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943)
[09:06:38] <wikibugs>	 (03Merged) 10jenkins-bot: SVG resources: Run svgo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841187 (https://phabricator.wikimedia.org/T320447) (owner: 10Urbanecm)
[09:07:00] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841187|SVG resources: Run svgo (T320447)]]
[09:07:05] <stashbot>	 T320447: Run svgo for all SVG resources in operations/mediawiki-config - https://phabricator.wikimedia.org/T320447
[09:07:25] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:841187|SVG resources: Run svgo (T320447)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[09:07:54] <wikibugs>	 (03PS9) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[09:08:21] <wikibugs>	 (03CR) 10Btullis: Add a new production images for spark and spark-operator (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:11:25] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:11:35] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:39] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841187|SVG resources: Run svgo (T320447)]] (duration: 04m 38s)
[09:12:30] <jayme>	 !log re-enabled puppet on all kubernetes masters (incl. ml & dse)
[09:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:27] <wikibugs>	 (03PS9) 10Urbanecm: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[09:17:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] logos: Cover wordmark/tagline in manage.py (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[09:18:29] <wikibugs>	 (03Merged) 10jenkins-bot: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[09:19:52] <wikibugs>	 (03PS2) 10Urbanecm: Replace wordmark/tagline with correct naming style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[09:20:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[09:20:33] <nemo-yiannis>	 Hi, i don't see a dedicated deployment window for restbase. What would be a good time to push a deployment today ?
[09:20:57] <wikibugs>	 (03Merged) 10jenkins-bot: Replace wordmark/tagline with correct naming style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[09:21:21] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:829561|Replace wordmark/tagline with correct naming style (T307705)]]
[09:21:26] <stashbot>	 T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705
[09:21:44] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:829561|Replace wordmark/tagline with correct naming style (T307705)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[09:22:55] <wikibugs>	 (03PS1) 10Daniel Kinzler: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531)
[09:22:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:23:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10MoritzMuehlenhoff)
[09:24:14] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] haproxy: fix apt repository path [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841477 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[09:24:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Decide on model for serving idm.wikimedia.org - https://phabricator.wikimedia.org/T320604 (10MoritzMuehlenhoff)
[09:25:41] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:829561|Replace wordmark/tagline with correct naming style (T307705)]] (duration: 04m 20s)
[09:26:18] <moritzm>	 !log draining ganeti1017 T311687
[09:26:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:22] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[09:27:44] <wikibugs>	 (03CR) 10D3r1ck01: Beta: Switch VE on dewiki to direct mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler)
[09:27:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:28:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Figure out an HA setup for the IDM - https://phabricator.wikimedia.org/T320605 (10MoritzMuehlenhoff)
[09:30:11] <wikibugs>	 (03PS1) 10Daniel Kinzler: Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535)
[09:30:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[09:32:05] <wikibugs>	 (03PS2) 10Daniel Kinzler: Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535)
[09:38:10] <wikibugs>	 (03PS2) 10Daniel Kinzler: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531)
[09:39:57] <wikibugs>	 (03CR) 10Kosta Harlan: "Hi Mohd and Santhosh, I made this patch as a follow-up from Id265f3ff87a80128c07e824b49f3b972df21e2d2; AIUI this code isn't called (?) cur" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan)
[09:41:29] <wikibugs>	 (03CR) 10Mabualruz: [C: 03+1] "Looks good to me" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan)
[09:51:03] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Hi, I'll be your SRE support for today, and will handle de/repooling, destroying th...
[09:52:57] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:01:17] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[10:01:50] <wikibugs>	 (03CR) 10Btullis: Add a new production images for spark and spark-operator (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[10:01:54] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[10:08:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre: issue confd per-template alerts [alerts] - 10https://gerrit.wikimedia.org/r/841549 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[10:08:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: issue confd per-template alerts [alerts] - 10https://gerrit.wikimedia.org/r/841549 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[10:13:21] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Destroy/apply done in staging: ` # helmfile -e staging status helmfile.yaml: basePa...
[10:16:11] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Update README.Debian to reflect latest changes for U2F/6.6/OIDC [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841864
[10:19:43] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841865
[10:20:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: confd: remove check_confd_template icinga check [puppet] - 10https://gerrit.wikimedia.org/r/841886 (https://phabricator.wikimedia.org/T314118)
[10:20:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118)
[10:20:17] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20115
[10:21:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20115
[10:22:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[10:25:49] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: fix separator for boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/841890
[10:26:50] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Once this is running in prod, I would like a test case adding so we can check we don't break it in future. But that (obviously) needn't bl" [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[10:29:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Ganeti role from ganeti1005 [puppet] - 10https://gerrit.wikimedia.org/r/841892 (https://phabricator.wikimedia.org/T320419)
[10:29:17] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:35] <wikibugs>	 (03PS1) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510)
[10:30:25] <wikibugs>	 (03Abandoned) 10Matthias Mullie: Explicitly set wgPageImagesNamespaces to none where disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841133 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie)
[10:30:37] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[10:32:04] <wikibugs>	 (03CR) 10Ladsgroup: "beta cluster is fine but before production, let's go through it together and make some optimizations. e.g. adding some logs would be nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[10:32:08] <wikibugs>	 (03PS2) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510)
[10:32:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[10:32:45] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:33:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance
[10:33:11] <wikibugs>	 (03CR) 10Cparle: [C: 03+1] Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) (owner: 10Matthias Mullie)
[10:33:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance
[10:33:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance
[10:33:24] <claime>	 !log depooling eventstreams in codfw - T310721
[10:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:29] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[10:33:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance
[10:33:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T318955)', diff saved to https://phabricator.wikimedia.org/P35418 and previous config saved to /var/cache/conftool/dbconfig/20221012-103338-ladsgroup.json
[10:33:42] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[10:33:51] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams,name=codfw
[10:35:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[10:36:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318955)', diff saved to https://phabricator.wikimedia.org/P35419 and previous config saved to /var/cache/conftool/dbconfig/20221012-103604-ladsgroup.json
[10:39:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[10:39:57] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:41:34] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [restbase/deploy@0474832]: Update restbase to 1a02cdfb
[10:48:15] <wikibugs>	 10SRE, 10Traffic, 10observability: ATS Request Error Ratio SLI shows negative values - https://phabricator.wikimedia.org/T320615 (10Vgutierrez)
[10:48:27] <wikibugs>	 10SRE, 10Traffic, 10observability: ATS Request Error Ratio SLI shows negative values - https://phabricator.wikimedia.org/T320615 (10Vgutierrez) p:05Triage→03Medium
[10:49:23] <moritzm>	 !log installing dbus security updates
[10:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:08] <wikibugs>	 (03PS1) 10DDesouza: Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331)
[10:51:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P35420 and previous config saved to /var/cache/conftool/dbconfig/20221012-105111-ladsgroup.json
[10:55:07] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove ganeti role from ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/841124 (owner: 10Muehlenhoff)
[10:55:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[10:57:07] <claime>	 !log redeploying eventstreams codfw - T310721
[10:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:12] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[10:58:00] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[10:58:46] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[11:00:28] <wikibugs>	 (03PS3) 10Ayounsi: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:01:43] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams,name=codfw
[11:02:07] <claime>	 !log repooled eventstreams in codfw - T310721
[11:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P35421 and previous config saved to /var/cache/conftool/dbconfig/20221012-110617-ladsgroup.json
[11:06:53] <wikibugs>	 (03PS1) 10Vgutierrez: mtail::atsbackend: Make sure that sli_total is always incremented [puppet] - 10https://gerrit.wikimedia.org/r/841896 (https://phabricator.wikimedia.org/T320615)
[11:07:20] <wikibugs>	 (03PS2) 10Vgutierrez: mtail::atsbackend: Ensure that sli_total is always incremented [puppet] - 10https://gerrit.wikimedia.org/r/841896 (https://phabricator.wikimedia.org/T320615)
[11:07:22] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@0474832]: Update restbase to 1a02cdfb (duration: 25m 48s)
[11:08:52] <wikibugs>	 (03CR) 10Santhosh: [C: 03+2] AddContributeCardEntryPoint: Use RequestContext::getMain [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan)
[11:09:16] <wikibugs>	 (03CR) 10Muehlenhoff: "One nit inline, but looks good in general" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:10:20] <wikibugs>	 (03PS4) 10Ayounsi: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:11:11] <moritzm>	 !log installing bind9 security updates on buster (client side tools/libs)
[11:11:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:34] <wikibugs>	 (03PS1) 10Ladsgroup: Add rename_flaggedrevs_indexes_T318950.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841899 (https://phabricator.wikimedia.org/T318950)
[11:11:41] <wikibugs>	 (03PS5) 10Ayounsi: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:11:50] <wikibugs>	 (03CR) 10Ayounsi: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:16:00] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF)  @Clement_Goubert Thank you so much! Please let us know if there is anything we need...
[11:20:56] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstream` redeployed in codfw.  @JArguello-WMF Apart from checking everything i...
[11:21:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318955)', diff saved to https://phabricator.wikimedia.org/P35422 and previous config saved to /var/cache/conftool/dbconfig/20221012-112124-ladsgroup.json
[11:21:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance
[11:21:29] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[11:21:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance
[11:21:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35423 and previous config saved to /var/cache/conftool/dbconfig/20221012-112146-ladsgroup.json
[11:24:02] <claime>	 !log depooling eventstreams in eqiad - T310721
[11:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:06] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[11:24:14] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams,name=eqiad
[11:28:43] <wikibugs>	 (03Merged) 10jenkins-bot: AddContributeCardEntryPoint: Use RequestContext::getMain [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan)
[11:44:06] <claime>	 !log redeploying eventstreams eqiad - T310721
[11:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:11] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[11:45:59] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[11:46:23] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[11:46:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35424 and previous config saved to /var/cache/conftool/dbconfig/20221012-114642-ladsgroup.json
[11:46:47] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[11:48:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:50:12] <claime>	 !log repooling eventstreams in eqiad - T310721
[11:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:17] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[11:51:15] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams,name=eqiad
[11:51:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1005.eqiad.wmnet with reason: Remove from cluster for eventual decom
[11:51:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1005.eqiad.wmnet with reason: Remove from cluster for eventual decom
[11:51:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth)
[11:52:21] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstream` redeployed in eqiad
[11:59:56] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Everything looks healthy from my end, both are getting traffic and not throwing err...
[12:00:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10ayounsi) 05In progress→03Resolved a:03ayounsi Users added to the WMF LDAP group, as well as #wmf-nda. And the private-data-users in h...
[12:01:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35425 and previous config saved to /var/cache/conftool/dbconfig/20221012-120148-ladsgroup.json
[12:08:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:11:17] <wikibugs>	 (03PS2) 10Muehlenhoff: dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013)
[12:12:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS buster
[12:13:05] <wikibugs>	 (03PS2) 10WMDE-Fisch: Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782)
[12:14:22] <wikibugs>	 (03CR) 10Svantje Lilienthal: [C: 03+1] Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[12:15:45] <wikibugs>	 (03PS5) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705)
[12:16:53] <wikibugs>	 (03Abandoned) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829760 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[12:16:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35426 and previous config saved to /var/cache/conftool/dbconfig/20221012-121655-ladsgroup.json
[12:17:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) Hi @KFrancis could you confirm that "User has a valid NDA on file with WMF legal" ?  Thanks!
[12:19:25] <wikibugs>	 (03CR) 10Stang: "Also fix some merge conflict.." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[12:19:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Ganeti role from ganeti1005 [puppet] - 10https://gerrit.wikimedia.org/r/841892 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff)
[12:20:03] <wikibugs>	 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Alert on individual pybal backend hosts being down for a long time - https://phabricator.wikimedia.org/T320627 (10fgiunchedi)
[12:25:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[12:28:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[12:28:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Ensure that sli_total is always incremented [puppet] - 10https://gerrit.wikimedia.org/r/841896 (https://phabricator.wikimedia.org/T320615) (owner: 10Vgutierrez)
[12:28:40] <wikibugs>	 (03CR) 10Daniel Kinzler: Beta: Enable parsoid cache warming. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[12:32:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35427 and previous config saved to /var/cache/conftool/dbconfig/20221012-123201-ladsgroup.json
[12:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance
[12:32:07] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[12:32:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance
[12:32:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T318955)', diff saved to https://phabricator.wikimedia.org/P35428 and previous config saved to /var/cache/conftool/dbconfig/20221012-123223-ladsgroup.json
[12:32:25] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) >  all the affected hosts are on stretch, but of the ~375 hosts we still have on stretch those are the o...
[12:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[12:36:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) Nevermind, found the spreadsheet, NDA is there.  @odimitrijevic or @Ottomata I need your approval as the request is for `analytics-privatedata-users`
[12:37:02] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10MoritzMuehlenhoff) >>! In T290984#8311170, @SLyngshede-WMF wrote: >>  all the affected hosts are on stretch, but of the...
[12:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:39:00] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) 05In progress→03Resolved
[12:40:44] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) > eventstreams-internal is still used?  I am not sure!  I'd imagine folks use it, as it is...
[12:41:17] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) Closed due to Stretch hosts having gone away.
[12:41:24] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:41:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Ottomata) Approved.
[12:42:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: test warning on pybal backends being down for long [alerts] - 10https://gerrit.wikimedia.org/r/841905 (https://phabricator.wikimedia.org/T320627)
[12:42:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi)
[12:43:30] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[12:45:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:46:41] <wikibugs>	 (03PS1) 10Ayounsi: admin: add manuel-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/841907 (https://phabricator.wikimedia.org/T320504)
[12:48:50] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Create a cookbook to switch an instance to DRBD/plain disk storage - https://phabricator.wikimedia.org/T312116 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The cookbook has been created as  sre.ganeti.changedisk and works fine.
[12:48:52] <wikibugs>	 10SRE, 10Ganeti: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff)
[12:49:03] <wikibugs>	 (03PS1) 10JMeybohm: dragonfly::dfdaemon: Fix dummy ssl_paths object [puppet] - 10https://gerrit.wikimedia.org/r/841908
[12:53:32] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/841907 (https://phabricator.wikimedia.org/T320504) (owner: 10Ayounsi)
[12:53:36] <wikibugs>	 (03PS2) 10JMeybohm: dragonfly::dfdaemon: Fix dummy ssl_paths object [puppet] - 10https://gerrit.wikimedia.org/r/841908
[12:54:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] admin: add manuel-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/841907 (https://phabricator.wikimedia.org/T320504) (owner: 10Ayounsi)
[12:55:41] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: restrict all internal traffic, not only TCP [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481)
[12:55:46] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:56:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab_runner: restrict all internal traffic, not only TCP [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[12:56:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) 05Open→03Resolved a:03ayounsi Give it 30min for the change to propagate and you should be all set. Please re-open if there are any issues.
[12:56:45] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: restrict all internal traffic, not only TCP [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481)
[12:57:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T318955)', diff saved to https://phabricator.wikimedia.org/P35429 and previous config saved to /var/cache/conftool/dbconfig/20221012-125725-ladsgroup.json
[12:57:31] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[12:58:50] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:58:52] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37513/console" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1300).
[13:00:04] <jouncebot>	 WMDE-Fisch, danisztls, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <urbanecm>	 i can deploy today!
[13:00:32] <WMDE-Fisch>	 o/
[13:00:36] <urbanecm>	 hi WMDE-Fisch!
[13:00:40] <koi>	 o/
[13:00:54] <WMDE-Fisch>	 Hi urbanecm. Would be great if you could deploy mine at least.
[13:01:12] <urbanecm>	 sure
[13:01:20] <wikibugs>	 (03PS3) 10Urbanecm: Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[13:01:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[13:02:11] <urbanecm>	 koi: hi! should we also do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/829764? or do you prefer to do it at a later date?
[13:02:33] <koi>	 urbanecm: I prefer a later date :)
[13:02:39] <urbanecm>	 sounds good
[13:04:26] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway)
[13:04:27] <urbanecm>	 I'm pleasantly surprised how well scap backport handles things. got the "unexpected commits" screen
[13:04:58] <logmsgbot>	 !log urbanecm@deploy1002 Backport cancelled.
[13:04:59] <WMDE-Fisch>	 o.O
[13:05:05] <wikibugs>	 (03PS1) 10Urbanecm: Revert "AddContributeCardEntryPoint: Use RequestContext::getMain" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841872 (https://phabricator.wikimedia.org/T319327)
[13:05:14] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "AddContributeCardEntryPoint: Use RequestContext::getMain" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841872 (https://phabricator.wikimedia.org/T319327) (owner: 10Urbanecm)
[13:05:18] <Amir1>	 jouncebot: nowandnext
[13:05:18] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1300)
[13:05:19] <jouncebot>	 In 4 hour(s) and 54 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800)
[13:05:19] <jouncebot>	 In 4 hour(s) and 54 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800)
[13:05:31] <Amir1>	 urbanecm: let me know once you're done
[13:05:33] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[13:05:41] <logmsgbot>	 !log urbanecm@deploy1002 backport aborted:  (duration: 00m 09s)
[13:05:45] <urbanecm>	 Amir1: will do
[13:05:49] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add webproxy to allowed_services [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481)
[13:05:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch)
[13:06:19] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841854|Enable show nearby feature on a small group of wikis (T316782)]]
[13:06:23] <stashbot>	 T316782: Deploy Show Nearby feature to small group of wikis - https://phabricator.wikimedia.org/T316782
[13:06:43] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and wmde-fisch: Backport for [[gerrit:841854|Enable show nearby feature on a small group of wikis (T316782)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:06:57] <urbanecm>	 matthiasmullie: can you test at mwdebug1001, please?
[13:07:07] <wikibugs>	 (03PS9) 10Ottomata: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto)
[13:08:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto)
[13:08:11] <WMDE-Fisch>	 urbanecm: Doing that now ;-)
[13:08:15] <urbanecm>	 eh, sorry
[13:08:22] <urbanecm>	 thanks
[13:08:56] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37514/console" [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[13:09:14] <WMDE-Fisch>	 urbanecm: Works like a charm. Go on please!
[13:09:18] <urbanecm>	 syncing!
[13:09:57] <moritzm>	 !log draining ganeti1007 T320419
[13:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:02] <stashbot>	 T320419: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419
[13:10:53] <wikibugs>	 (03PS6) 10Urbanecm: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[13:11:00] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[13:11:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti1007 [puppet] - 10https://gerrit.wikimedia.org/r/841914 (https://phabricator.wikimedia.org/T320419)
[13:12:22] * urbanecm doesn't see danisztls, will skip their patch
[13:12:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P35430 and previous config saved to /var/cache/conftool/dbconfig/20221012-131232-ladsgroup.json
[13:13:22] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841854|Enable show nearby feature on a small group of wikis (T316782)]] (duration: 07m 03s)
[13:13:27] <stashbot>	 T316782: Deploy Show Nearby feature to small group of wikis - https://phabricator.wikimedia.org/T316782
[13:13:30] <urbanecm>	 WMDE-Fisch: should be live!
[13:13:33] <wikibugs>	 (03Merged) 10jenkins-bot: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[13:13:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[13:14:00] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:829563|Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php (T307705)]]
[13:14:04] <stashbot>	 T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705
[13:14:23] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:829563|Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php (T307705)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:14:34] <urbanecm>	 koi: live at mwdebug1001, can you check please?
[13:14:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts d-i-test.eqiad.wmnet
[13:14:58] <koi>	 wondering how to check this... randomly pick some sites?
[13:15:26] <urbanecm>	 koi: yeah
[13:15:59] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Beta: Enable parsoid cache warming. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[13:16:05] <danisztls>	 Hi. I'm late. Is there still time?
[13:16:54] <koi>	 urbanecm: I checked nowikimedia, bnwikibooks, zhwiki, wikidatawiki, no issue found, so LGTM
[13:16:59] <urbanecm>	 great!
[13:17:05] <urbanecm>	 danisztls: yup yup
[13:17:12] <urbanecm>	 the https://gerrit.wikimedia.org/r/c/841895/, right?
[13:17:26] <danisztls>	 urbanecm: yes :)
[13:18:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:18:54] <wikibugs>	 (03PS2) 10Urbanecm: Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[13:19:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[13:20:35] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[13:21:07] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:829563|Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php (T307705)]] (duration: 07m 06s)
[13:21:11] <stashbot>	 T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705
[13:21:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza)
[13:21:26] <urbanecm>	 koi: and live!
[13:21:33] <koi>	 thanks!
[13:21:37] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841895|Remove Research Incentive survey from eswiki (T318331)]]
[13:21:42] <stashbot>	 T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331
[13:21:52] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "a little bit more context in T295481#8311437" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[13:22:01] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:841895|Remove Research Incentive survey from eswiki (T318331)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:22:01] <urbanecm>	 let me just purge the static resources cache too
[13:22:09] <urbanecm>	 danisztls: your patch is at mwdebug1001, please test
[13:22:24] <danisztls>	 urbanecm: lgtm
[13:22:40] <wikibugs>	 (03PS1) 10Vgutierrez: mtail::atsbackend: Fix TTFB regex [puppet] - 10https://gerrit.wikimedia.org/r/841917 (https://phabricator.wikimedia.org/T320615)
[13:23:10] <urbanecm>	 great, syncing
[13:23:24] <urbanecm>	 purged
[13:24:10] <wikibugs>	 (03PS2) 10Vgutierrez: mtail::atsbackend: Fix TTFB regex [puppet] - 10https://gerrit.wikimedia.org/r/841917 (https://phabricator.wikimedia.org/T320615)
[13:24:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:24:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts d-i-test.eqiad.wmnet
[13:24:34] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `d-i-test.eqiad.wmnet` - d-i-test.eqiad.wmnet (**WARN**)   - //Host not found on Icinga, una...
[13:25:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove d-i-test Puppet references [puppet] - 10https://gerrit.wikimedia.org/r/841918
[13:25:48] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:59] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841895|Remove Research Incentive survey from eswiki (T318331)]] (duration: 05m 21s)
[13:27:03] <stashbot>	 T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331
[13:27:08] <urbanecm>	 danisztls: and live
[13:27:08] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM although would be nice to update ACLs eventually as well" [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:27:16] <danisztls>	 urbanecm: thanks!
[13:27:18] <urbanecm>	 Amir1: I'm done, over to you :)
[13:27:29] <wikibugs>	 10SRE, 10observability: Overlap between "check systemd state" alert and "check unit status of <unit>" - https://phabricator.wikimedia.org/T319304 (10fgiunchedi)
[13:27:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P35431 and previous config saved to /var/cache/conftool/dbconfig/20221012-132738-ladsgroup.json
[13:27:52] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi)
[13:28:03] <Amir1>	 thanks
[13:28:09] <Amir1>	 I'm going to mess with mwdebug1001
[13:30:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove d-i-test from special handling [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919
[13:30:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove d-i-test Puppet references [puppet] - 10https://gerrit.wikimedia.org/r/841918 (owner: 10Muehlenhoff)
[13:34:29] <wikibugs>	 (03PS1) 10Ladsgroup: rdbms: Instead of reconfiguring all of LB, just remove depooled db [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485)
[13:35:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 (owner: 10Volans)
[13:36:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "Deploying to mwdebug only to test depool, will revert afterwards." [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[13:36:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.provision: fix separator for boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 (owner: 10Volans)
[13:41:29] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: fix separator for boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 (owner: 10Volans)
[13:42:10] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] deploy swift_ring_manager to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) (owner: 10Zabe)
[13:42:40] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10bking) a:05bking→03None
[13:42:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T318955)', diff saved to https://phabricator.wikimedia.org/P35432 and previous config saved to /var/cache/conftool/dbconfig/20221012-134245-ladsgroup.json
[13:42:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[13:42:50] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[13:43:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[13:43:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35433 and previous config saved to /var/cache/conftool/dbconfig/20221012-134306-ladsgroup.json
[13:43:30] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar)
[13:43:55] <wikibugs>	 (03PS1) 10Ssingh: hiera: use Linux 5.10 on cp4045 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/841923 (https://phabricator.wikimedia.org/T319067)
[13:44:50] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar)
[13:44:57] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10SLyngshede-WMF) 05Open→03Resolved Closed, @BTullis has submitted a patch and this hasn't been an issue since. We'll reopen th...
[13:46:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841923 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[13:47:06] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: use Linux 5.10 on cp4045 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/841923 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[13:47:28] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[13:47:56] <wikibugs>	 (03PS2) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705)
[13:48:23] <wikibugs>	 (03PS3) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705)
[13:48:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: systemd: drop timer-specific alert in favor of generic alert [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253)
[13:49:37] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[13:49:38] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Gehel)
[13:50:46] <wikibugs>	 (03PS4) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705)
[13:52:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[13:52:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Mostly a proposal, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[13:52:47] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: Instead of reconfiguring all of LB, just remove depooled db [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[13:53:12] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:841873|rdbms: Instead of reconfiguring all of LB, just remove depooled db (T298485)]]
[13:53:17] <stashbot>	 T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485
[13:53:36] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:841873|rdbms: Instead of reconfiguring all of LB, just remove depooled db (T298485)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:54:06] <icinga-wm>	 RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms
[13:54:48] <icinga-wm>	 RECOVERY - Host an-worker1098.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.53 ms
[13:54:54] <icinga-wm>	 RECOVERY - Host clouddumps1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.75 ms
[13:55:44] <icinga-wm>	 RECOVERY - Host cp1081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.72 ms
[13:55:46] <icinga-wm>	 RECOVERY - Host ms-be1041.mgmt is UP: PING WARNING - Packet loss = 60%, RTA = 1.07 ms
[13:56:24] <icinga-wm>	 RECOVERY - Host cp1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.60 ms
[13:56:34] <icinga-wm>	 RECOVERY - Host ms-be1053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.54 ms
[13:56:36] <icinga-wm>	 RECOVERY - Host restbase-dev1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.45 ms
[13:56:44] <icinga-wm>	 RECOVERY - Host an-worker1087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms
[13:57:14] <icinga-wm>	 RECOVERY - Host clouddb1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms
[13:57:24] <icinga-wm>	 RECOVERY - Host mw1313.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms
[13:57:24] <icinga-wm>	 RECOVERY - Host mw1315.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms
[13:59:26] <icinga-wm>	 RECOVERY - Host an-worker1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms
[13:59:48] <icinga-wm>	 RECOVERY - Host mw1314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[13:59:48] <icinga-wm>	 RECOVERY - Host mw1316.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms
[14:00:10] <icinga-wm>	 RECOVERY - Host ores1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms
[14:01:24] <icinga-wm>	 RECOVERY - Host analytics1073.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms
[14:01:24] <icinga-wm>	 RECOVERY - Host cloudvirt1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms
[14:01:24] <icinga-wm>	 RECOVERY - Host cloudvirt1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[14:01:25] <icinga-wm>	 RECOVERY - Host cloudvirt1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[14:02:58] <icinga-wm>	 RECOVERY - Host lvs1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms
[14:03:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thx!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 (owner: 10Muehlenhoff)
[14:04:47] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[14:06:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P35434 and previous config saved to /var/cache/conftool/dbconfig/20221012-140626-ladsgroup.json
[14:07:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1175', diff saved to https://phabricator.wikimedia.org/P35435 and previous config saved to /var/cache/conftool/dbconfig/20221012-140746-ladsgroup.json
[14:08:14] <logmsgbot>	 !log ladsgroup@deploy1002 Sync cancelled.
[14:08:32] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874
[14:08:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 (owner: 10Ladsgroup)
[14:09:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35436 and previous config saved to /var/cache/conftool/dbconfig/20221012-140903-ladsgroup.json
[14:09:08] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[14:10:57] <wikibugs>	 (03PS3) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706)
[14:13:32] <wikibugs>	 (03PS1) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930
[14:15:57] <Amir1>	 duesen: finally the live depool works ^^
[14:16:10] <Amir1>	 basically this is needed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/828577/
[14:17:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro)
[14:18:07] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster
[14:18:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Investigate issue with msw-b7-eqiad - https://phabricator.wikimedia.org/T320598 (10cmooney) 05Open→03Resolved a:03cmooney @Jclark-ctr has replaced the switch and devices are now back online: `lines=10 cmooney@msw1-eqiad> show ethernet-switch...
[14:19:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[14:24:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P35438 and previous config saved to /var/cache/conftool/dbconfig/20221012-142410-ladsgroup.json
[14:29:07] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 (owner: 10Ladsgroup)
[14:30:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 (owner: 10Ladsgroup)
[14:30:41] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:841874|Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db"]]
[14:31:04] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:841874|Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[14:34:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Fix TTFB regex [puppet] - 10https://gerrit.wikimedia.org/r/841917 (https://phabricator.wikimedia.org/T320615) (owner: 10Vgutierrez)
[14:35:18] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:841874|Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db"]] (duration: 04m 37s)
[14:39:04] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:39:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P35439 and previous config saved to /var/cache/conftool/dbconfig/20221012-143917-ladsgroup.json
[14:39:29] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[14:41:17] <wikibugs>	 (03PS2) 10DLynch: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390)
[14:42:50] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10phaultfinder)
[14:45:39] <icinga-wm>	 RECOVERY - DNS on lvs1018.mgmt is OK: DNS OK: 0.017 seconds response time. lvs1018.mgmt.eqiad.wmnet returns 10.65.1.209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:49:55] <icinga-wm>	 RECOVERY - DNS on elastic1085.mgmt is OK: DNS OK: 0.010 seconds response time. elastic1085.mgmt.eqiad.wmnet returns 10.65.1.222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:49:55] <icinga-wm>	 RECOVERY - DNS on elastic1086.mgmt is OK: DNS OK: 0.010 seconds response time. elastic1086.mgmt.eqiad.wmnet returns 10.65.1.223 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:50:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Install 5.10 in late_setup.sh for next Gen PowerEdges [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067)
[14:50:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Manuel) Thank you all!
[14:53:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Install 5.10 in late_setup.sh for next Gen PowerEdges [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[14:54:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35440 and previous config saved to /var/cache/conftool/dbconfig/20221012-145423-ladsgroup.json
[14:54:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance
[14:54:29] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[14:54:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance
[14:54:43] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Install 5.10 in late_setup.sh for next Gen PowerEdges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[14:54:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T318955)', diff saved to https://phabricator.wikimedia.org/P35441 and previous config saved to /var/cache/conftool/dbconfig/20221012-145445-ladsgroup.json
[14:56:11] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster
[14:56:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Install 5.10 in late_setup.sh for next Gen PowerEdges [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[14:57:06] <claime>	 !log depooling eventstreams-internal codfw - T310721
[14:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:11] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[14:57:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318955)', diff saved to https://phabricator.wikimedia.org/P35442 and previous config saved to /var/cache/conftool/dbconfig/20221012-145711-ladsgroup.json
[14:57:21] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams-internal,name=codfw
[14:57:51] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:00:39] <wikibugs>	 (03PS4) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468)
[15:00:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Install 5.10 in late_setup.sh for next Gen PowerEdges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[15:03:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[15:03:21] <sukhe>	 ^ third time's a charm
[15:03:27] <icinga-wm>	 RECOVERY - DNS on an-worker1130.mgmt is OK: DNS OK: 0.016 seconds response time. an-worker1130.mgmt.eqiad.wmnet returns 10.65.0.156 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:07:03] <claime>	 !log redeploying eventstreams-internal codfw - T310721
[15:07:03] <wikibugs>	 (03PS23) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[15:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:07] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[15:07:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Please let me know what you think! The background/context is reducing per-host IRC alert spam, while at the same time keep the alerts rele" [alerts] - 10https://gerrit.wikimedia.org/r/841905 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi)
[15:07:23] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[15:07:45] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[15:08:01] <icinga-wm>	 RECOVERY - DNS on kafka-main1002.mgmt is OK: DNS OK: 0.010 seconds response time. kafka-main1002.mgmt.eqiad.wmnet returns 10.65.3.130 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:08:13] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: make errors more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/841938
[15:09:20] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal,name=codfw
[15:09:34] <claime>	 !log repooled eventstreams-internal codfw - T310721
[15:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P35443 and previous config saved to /var/cache/conftool/dbconfig/20221012-151217-ladsgroup.json
[15:13:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[15:14:37] <wikibugs>	 (03PS3) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510)
[15:15:26] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@2d002b3]: Add ig,bcl,bn,tl wikiquote, ig wiktionary T314641
[15:15:31] <stashbot>	 T314641: Add igwikiquote to RESTBase - https://phabricator.wikimedia.org/T314641
[15:15:51] <icinga-wm>	 RECOVERY - DNS on dbprov1002.mgmt is OK: DNS OK: 0.011 seconds response time. dbprov1002.mgmt.eqiad.wmnet returns 10.65.3.18 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:16:48] <claime>	 !log depooling eventstreams-internal eqiad - T310721
[15:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:53] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[15:16:55] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams-internal,name=eqiad
[15:18:51] <wikibugs>	 (03PS13) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[15:19:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[15:22:12] <wikibugs>	 (03PS2) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529)
[15:23:04] <wikibugs>	 (03PS14) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[15:23:29] <claime>	 !log redeploying eventstreams-internal eqiad - T310721
[15:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:34] <stashbot>	 T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721
[15:23:47] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[15:23:54] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[15:24:14] <wikibugs>	 (03CR) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[15:24:36] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[15:24:41] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[15:25:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (one typo inline)" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[15:25:07] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage
[15:26:33] <ottomata>	 !log remove materialized .json files from schemas/event/primary - this should be a no-op as no clients should actually be using the json files. - T315674
[15:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:37] <stashbot>	 T315674: Remove materialized .json files from event schema repositories - https://phabricator.wikimedia.org/T315674
[15:26:37] <wikibugs>	 (03PS1) 10Urbanecm: eswiki: Deploy mentorship to only 15% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841939 (https://phabricator.wikimedia.org/T285235)
[15:26:40] <urbanecm>	 jouncebot: nowandnext
[15:26:41] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 33 minute(s)
[15:26:41] <jouncebot>	 In 2 hour(s) and 33 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800)
[15:26:41] <jouncebot>	 In 2 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800)
[15:26:55] <wikibugs>	 (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[15:27:01] <urbanecm>	 ^^going to ship the above, it's time-sensitive for Growth^^
[15:27:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P35444 and previous config saved to /var/cache/conftool/dbconfig/20221012-152724-ladsgroup.json
[15:27:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841939 (https://phabricator.wikimedia.org/T285235) (owner: 10Urbanecm)
[15:27:54] <wikibugs>	 (03CR) 10Hnowlan: thumbor: new service chart (0318 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[15:28:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams-internal_4992: Servers kubernetes1008.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/Py
[15:28:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage
[15:29:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job swagger_check_eventstreams_internal_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:30:26] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[15:30:41] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[15:31:15] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #6 [puppet] - 10https://gerrit.wikimedia.org/r/841941 (https://phabricator.wikimedia.org/T317748)
[15:31:28] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@2d002b3]: Add ig,bcl,bn,tl wikiquote, ig wiktionary T314641 (duration: 16m 02s)
[15:31:32] <stashbot>	 T314641: Add igwikiquote to RESTBase - https://phabricator.wikimedia.org/T314641
[15:32:46] <wikibugs>	 (03Merged) 10jenkins-bot: eswiki: Deploy mentorship to only 15% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841939 (https://phabricator.wikimedia.org/T285235) (owner: 10Urbanecm)
[15:32:49] <urbanecm>	 finally
[15:33:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump version in Chart.yaml too, otherwise the changes will not be deployable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto)
[15:33:09] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841939|eswiki: Deploy mentorship to only 15% of users (T285235)]]
[15:33:15] <stashbot>	 T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235
[15:33:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:33:32] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:841939|eswiki: Deploy mentorship to only 15% of users (T285235)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[15:33:33] <wikibugs>	 (03PS1) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705)
[15:33:50] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal,name=eqiad
[15:34:13] <claime>	 Sorry for the eventstreams-internal alarms
[15:34:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job swagger_check_eventstreams_internal_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:33] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841939|eswiki: Deploy mentorship to only 15% of users (T285235)]] (duration: 04m 23s)
[15:39:11] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstreams-internal` fully redeployed, this task can probably be closed now.
[15:39:19] <jinxer-wm>	 (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:39:48] <jinxer-wm>	 (CertAlmostExpired) firing: Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:40:08] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37516/console" [puppet] - 10https://gerrit.wikimedia.org/r/841941 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[15:42:24] <wikibugs>	 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) Thank you so much!
[15:42:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318955)', diff saved to https://phabricator.wikimedia.org/P35445 and previous config saved to /var/cache/conftool/dbconfig/20221012-154230-ladsgroup.json
[15:42:32] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[15:42:36] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:43:39] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:44:48] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:45:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #6 [puppet] - 10https://gerrit.wikimedia.org/r/841941 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[15:45:45] <vgutierrez>	 !log partitioning the ATS cache in cp[2031-2032], cp[6002,6010], cp[1079-1080], cp[5003,5009], cp[3054-3055], cp[4023,4032] - T317748
[15:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:51] <stashbot>	 T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748
[15:46:16] <wikibugs>	 (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES IN WMF.4 [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) (owner: 10Jdlrobson)
[15:46:50] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cp4045.ulsfo.wmnet with OS buster
[15:47:08] <sukhe>	 eh
[15:47:28] <vgutierrez>	 :)
[15:47:33] <vgutierrez>	 stop breaking things sukhe ;P
[15:47:43] <volans>	 ll
[15:47:44] <volans>	 lol
[15:48:02] <sukhe>	 vgutierrez: next time I will be more careful!
[15:48:43] <vgutierrez>	 I think you missed the step where you're supposed to sing a lullaby to the newly installed server
[15:49:07] <sukhe>	 vgutierrez: I am going to sing this https://www.youtube.com/watch?v=dQw4w9WgXcQ
[15:49:19] <vgutierrez>	 ahhahaha
[15:50:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr @Cmjohnson I am planning on moving all the links on cr[1-2]-eqaid from fpc4 to fpc3 for the once in both cr1-eqiad from FPC4 to FPC3 and cr2...
[15:55:10] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: increase Netbox polling [cookbooks] - 10https://gerrit.wikimedia.org/r/841943
[15:55:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thank you for the quick patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 (owner: 10Volans)
[15:55:50] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840583
[16:00:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10Andrew) Let's back off of this plan for OSDs. The two nics on hypervisors are control plane and data plane, whereas on the OSDs they're both dataplane (on...
[16:01:13] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840584
[16:01:45] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Mostly UI changes, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/841938 (owner: 10Volans)
[16:01:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: increase Netbox polling [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 (owner: 10Volans)
[16:03:31] <wikibugs>	 (03PS2) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930
[16:05:03] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: make errors more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/841938 (owner: 10Volans)
[16:05:26] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: increase Netbox polling [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 (owner: 10Volans)
[16:05:28] <wikibugs>	 (03PS1) 10Elukey: Deploy Istio 1.9.5-6 Docker images to the ML clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/841944 (https://phabricator.wikimedia.org/T320468)
[16:10:06] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840585
[16:12:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Deploy Istio 1.9.5-6 Docker images to the ML clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/841944 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[16:19:04] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[16:22:24] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840584 (owner: 10PipelineBot)
[16:22:37] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840583 (owner: 10PipelineBot)
[16:28:28] <wikibugs>	 (03PS1) 10Elukey: ml-services: update Docker images after code refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/841947 (https://phabricator.wikimedia.org/T320374)
[16:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:33:59] <wikibugs>	 10SRE, 10Traffic, 10observability: ATS Request Error Ratio SLI shows negative values - https://phabricator.wikimedia.org/T320615 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez {F35564906}
[16:34:42] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841966
[16:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:49:44] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[16:51:36] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840585 (owner: 10PipelineBot)
[16:52:03] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841966 (owner: 10PipelineBot)
[16:55:05] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[16:55:07] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[16:55:22] <wikibugs>	 (03PS1) 10JHathaway: otrs_aliases.py: add postfix support [puppet] - 10https://gerrit.wikimedia.org/r/841950
[16:55:37] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[16:55:39] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[16:55:53] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/841950 (owner: 10JHathaway)
[16:56:00] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/841950 (owner: 10JHathaway)
[16:57:30] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841966 (owner: 10PipelineBot)
[17:00:22] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[17:00:34] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[17:02:15] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[17:03:06] <wikibugs>	 (03PS4) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[17:03:30] <wikibugs>	 (03PS1) 10Ssingh: cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067)
[17:05:29] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Class[Profile::Mariadb::Generic_server]: has no parameter named 'ensure'" [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[17:05:34] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) Thanks to @MoritzMuehlenhoff and @Volans for their help in resolving the buster Linux 5.10 issue!  ` sukhe@cp4045:~$ uname -r 5.10.0-0.deb10.17-a...
[17:06:32] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply
[17:06:55] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply
[17:07:08] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply
[17:07:20] <wikibugs>	 (03PS10) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[17:07:47] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply
[17:08:36] <wikibugs>	 (03CR) 10Btullis: Add a new production images for spark and spark-operator (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[17:08:41] <wikibugs>	 (03PS5) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[17:09:01] <wikibugs>	 (03PS2) 10Ssingh: cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067)
[17:09:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[17:14:08] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) a:05Dzahn→03Clement_Goubert Just for clarification, we are talking about the service named `apple-search` in service discovery...
[17:14:33] <wikibugs>	 (03PS6) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[17:16:59] <wikibugs>	 (03PS7) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059)
[17:17:13] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37520/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[17:17:36] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn)
[17:26:11] <wikibugs>	 (03PS3) 10Ssingh: cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067)
[17:30:17] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[17:31:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37522/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[17:32:39] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "ferm refreshed on gitlab-runner1003, no issues" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[17:35:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[17:35:37] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312)
[17:35:42] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312)
[17:39:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[17:39:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[17:46:24] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37523/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[17:49:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[17:53:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "/etc/ferm/conf.d/18_docker-allow-webproxy-codw-http  and others have been created, ferm was refreshed, saw no issues. on gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[17:57:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:57:46] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:04] <jouncebot>	 dduvall and ^demon: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800).
[18:00:05] <jouncebot>	 dduvall and ^demon: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800).
[18:02:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:02:40] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:03:45] <logmsgbot>	 !log dduvall@deploy1002 deploy-promote aborted:  (duration: 00m 07s)
[18:03:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[18:03:55] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841957 (https://phabricator.wikimedia.org/T314194)
[18:03:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841957 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[18:04:43] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841957 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[18:08:28] <icinga-wm>	 PROBLEM - Host mw1314.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[18:09:05] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.5  refs T314194
[18:09:10] <stashbot>	 T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194
[18:12:25] <wikibugs>	 (03PS1) 10Cwhite: logstash: drop noisy envoy deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/841967 (https://phabricator.wikimedia.org/T320468)
[18:12:43] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.5  refs T314194 (duration: 03m 38s)
[18:16:33] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: drop noisy envoy deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/841967 (https://phabricator.wikimedia.org/T320468) (owner: 10Cwhite)
[18:22:40] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:24:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10Jclark-ctr) a:03Jclark-ctr
[18:25:07] <wikibugs>	 (03PS1) 10Zabe: Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183)
[18:25:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10Jclark-ctr) @BTullis  @elukey   Management switch failed today and was replaced can you verify if it is still not working for you?
[18:26:54] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[18:27:03] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster
[18:29:21] <wikibugs>	 (03CR) 10Yahya: [C: 03+1] Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe)
[18:32:42] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:40:26] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster
[18:40:33] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors: - cp...
[18:40:48] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:41:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[18:42:01] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster
[18:47:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:49:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "nits:" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:52:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:54:20] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:55:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10KFrancis) @ayounsi I am confirming Manuel Merz has an NDA on file.  Please proceed with the access request.  Thanks!
[18:55:24] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:28] <wikibugs>	 (03PS1) 10Cwhite: logstash: expand filter to drop more envoy deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/841968 (https://phabricator.wikimedia.org/T320468)
[19:00:07] <wikibugs>	 (03CR) 10Hashar: Send events to Wikimedia EventGate (036 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar)
[19:02:32] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: expand filter to drop more envoy deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/841968 (https://phabricator.wikimedia.org/T320468) (owner: 10Cwhite)
[19:02:37] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:14] <rzl>	 ^ looking
[19:04:53] <rzl>	 change in fundraising redirect behavior, I'll make sure it's intended and then update the tests
[19:06:09] <sukhe>	 thanks! seemed like a simple test failure and hence I didn't bother to ping
[19:06:21] <wikibugs>	 (03PS2) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705)
[19:07:31] <rzl>	 yeah for sure, no urgency
[19:12:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "A few preliminary comments, nothing major!" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[19:15:46] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4045.ulsfo.wmnet with OS buster
[19:15:54] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors: - cp...
[19:16:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster
[19:16:23] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster
[19:21:21] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841969
[19:21:22] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841970
[19:21:59] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:23:47] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:25:42] <wikibugs>	 (03PS1) 10Stang: yiwiktionary: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961)
[19:29:01] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster
[19:29:09] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors: - cp...
[19:31:07] <wikibugs>	 (03CR) 10Hashar: Send events to Wikimedia EventGate (033 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar)
[19:37:03] <icinga-wm>	 PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:38:12] <rzl>	 frtech confirms those httpbb failures are catching an expected change that just went out with 1.40.0-wmf.5, so I'll update the asserts
[19:47:45] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:49:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:50:52] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10Eevans) >>! In T317417#8280934, @MusikAnimal wrote: >>>! In T317417#8280822, @Eevans wro...
[19:51:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10serviceops-collab: Q2:rack/setup/install webperf1005.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn)
[19:51:34] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: webperf1005.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) 05Open→03Stalled
[19:52:18] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: webperf2005.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10Dzahn) 05Open→03Stalled
[19:52:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install webperf2005.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn)
[19:53:24] <wikibugs>	 (03PS3) 10Samtar: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch)
[19:58:29] <TheresNoTime>	 (a little early but) I can deploy! :D
[19:59:19] * bd808 looks at clock, looks at TheresNoTime, looks at timezone map, looks away ;)
[19:59:42] <Kemayo>	 TheresNoTime: It's not one I can test anything about, so if it merges and doesn't immediately cause errors it can go out.
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T2000).
[20:00:05] <jouncebot>	 kemayo, zabe, duesen, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <TheresNoTime>	 Kemayo: awesome, I'll wait for the window to start proper but will do yours first
[20:00:09] <TheresNoTime>	 oh, there :D
[20:00:16] <zabe>	 o/
[20:00:21] <koi>	 o/
[20:00:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch)
[20:00:47] <TheresNoTime>	 o/
[20:01:21] <duesen>	 o/
[20:02:01] <duesen>	 Can someone clarify whether config deployments for beta need scap? 
[20:02:18] <wikibugs>	 (03Merged) 10jenkins-bot: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch)
[20:02:48] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]]
[20:02:53] <stashbot>	 T310390: Instrument blocked edit attempts - https://phabricator.wikimedia.org/T310390
[20:03:11] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:03:16] <TheresNoTime>	 duesen: not really, they end up on the beta cluster automagically after they're +2'd
[20:03:56] <dancy>	 duesen: You should use `scap backport` on beta-only config changes to ensure that they get pulled down the the deploy server (to avoid an alert). They won't be synced.
[20:03:57] <duesen>	 Ok. But I guess it's still good to scap, since otherwise, files get out of whack on the prod servers, even if they aren't used there...
[20:04:02] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Jclark-ctr) 05Open→03Resolved
[20:04:17] <TheresNoTime>	 Kemayo: syncing 833442, nothing broken afaics
[20:04:26] <Kemayo>	 TheresNoTime: great, thanks!
[20:04:36] <duesen>	 dancy: oh, they won't be synced? are they excluded somehow?
[20:05:07] <dancy>	 well, they will eventually be synced during a subsequent sync that someone else might run
[20:05:23] <dancy>	 but `scap backport` will skip a needless sync if it detects a beta-only change.
[20:05:33] <duesen>	 magic...
[20:05:54] <TheresNoTime>	 zabe: your patch will be next FYI
[20:06:16] <wikibugs>	 (03PS2) 10Samtar: Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe)
[20:07:49] <TheresNoTime>	 duesen: are you wanting to self-deploy? :)
[20:08:20] <duesen>	 TheresNoTime: yea, I want to try the new magic thingy :)
[20:08:30] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] (duration: 05m 42s)
[20:08:35] <stashbot>	 T310390: Instrument blocked edit attempts - https://phabricator.wikimedia.org/T310390
[20:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe)
[20:09:10] <TheresNoTime>	 duesen: :D I'll just get the ones ahead of you done then it'll be all yours
[20:09:28] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe)
[20:09:31] <duesen>	 TheresNoTime: ok, let me know. 
[20:09:36] <TheresNoTime>	 will do
[20:09:55] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:841961|Set $wgSitename for bnwikiquote (T319183)]]
[20:09:59] <stashbot>	 T319183: Create Wikiquote Bengali - https://phabricator.wikimedia.org/T319183
[20:10:18] <logmsgbot>	 !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:841961|Set $wgSitename for bnwikiquote (T319183)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:10:20] <TheresNoTime>	 zabe: live on mwdebug, can you test?
[20:10:31] <zabe>	 TheresNoTime, lgtm
[20:10:34] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841971
[20:10:40] <TheresNoTime>	 syncin'
[20:11:03] <duesen>	 TheresNoTime: I still need to be in the correct directory when doing the scap, right?
[20:11:27] <TheresNoTime>	 duesen: I don't think so, but I change to it out of habit anyway
[20:11:51] <duesen>	 i see
[20:12:16] <TheresNoTime>	 (a helpful answer, I know!)
[20:14:36] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:841961|Set $wgSitename for bnwikiquote (T319183)]] (duration: 04m 40s)
[20:14:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[20:14:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[20:15:11] <TheresNoTime>	 koi: just doing yours now :)
[20:15:49] <wikibugs>	 (03Merged) 10jenkins-bot: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[20:16:09] <wikibugs>	 (03Merged) 10jenkins-bot: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang)
[20:16:24] <koi>	 TheresNoTime: this two patch I thought there's no need to be tested, so you could sync directly
[20:16:34] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:829764|Drop unused wordmark/tagline (T307705)]], [[gerrit:841942|Re-download and optimize wordmark/tagline svg file (T307705)]]
[20:16:39] <stashbot>	 T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705
[20:16:39] <zabe>	 Thanks sammy :)
[20:16:57] <logmsgbot>	 !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:829764|Drop unused wordmark/tagline (T307705)]], [[gerrit:841942|Re-download and optimize wordmark/tagline svg file (T307705)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:16:58] <TheresNoTime>	 koi: okay :)
[20:19:16] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841969 (owner: 10PipelineBot)
[20:19:21] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841970 (owner: 10PipelineBot)
[20:19:26] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841971 (owner: 10PipelineBot)
[20:20:36] <wikibugs>	 (03PS1) 10JHathaway: add dummy mysql password for postfix [labs/private] - 10https://gerrit.wikimedia.org/r/842002
[20:21:27] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:829764|Drop unused wordmark/tagline (T307705)]], [[gerrit:841942|Re-download and optimize wordmark/tagline svg file (T307705)]] (duration: 04m 53s)
[20:21:35] <TheresNoTime>	 koi: all done :)
[20:21:38] <TheresNoTime>	 duesen: all yours!
[20:22:05] <duesen>	 ok, let me see...
[20:23:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[20:23:27] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] add dummy mysql password for postfix [labs/private] - 10https://gerrit.wikimedia.org/r/842002 (owner: 10JHathaway)
[20:23:30] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] add dummy mysql password for postfix [labs/private] - 10https://gerrit.wikimedia.org/r/842002 (owner: 10JHathaway)
[20:23:47] <wikibugs>	 (03Merged) 10jenkins-bot: Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler)
[20:26:28] <duesen>	 checking that beta didn't explode...
[20:27:31] <TheresNoTime>	 T320535 looks pretty interesting..
[20:27:32] <stashbot>	 T320535: Put Parsoid output into the ParserCache on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320535
[20:28:03] <duesen>	 ok, looking good.
[20:28:07] <duesen>	 moving on to the next one
[20:28:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler)
[20:28:43] <duesen>	 hrrmm... Gerrit could not merge the change '841858' as is and could require a rebase
[20:28:59] <wikibugs>	 (03PS3) 10Daniel Kinzler: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531)
[20:29:08] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler)
[20:29:48] <wikibugs>	 (03Merged) 10jenkins-bot: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler)
[20:30:25] <duesen>	 Testing VE on dewiki beta
[20:31:10] <koi>	 TheresNoTime: I posted one more patch, could you please deploy that? thanks
[20:31:15] <TheresNoTime>	 duesen: ah, you'll need to wait for https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/413018/console (and the associated `beta-scap-sync-world` job) to finish
[20:31:30] <TheresNoTime>	 koi: sure, will do it after ^ :)
[20:32:13] <duesen>	 Looking good.
[20:32:17] <duesen>	 ok, all done! Thank you!
[20:32:42] <TheresNoTime>	 koi: which patch? :)
[20:32:58] <TheresNoTime>	 ah, 841992
[20:33:04] <koi>	 yep
[20:33:04] <wikibugs>	 (03PS3) 10Samtar: yiwiktionary: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang)
[20:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[20:34:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang)
[20:34:35] <TheresNoTime>	 koi: assume you will be able to test this one?
[20:34:44] <koi>	 yeah, I'll test this one
[20:34:48] <wikibugs>	 (03Merged) 10jenkins-bot: yiwiktionary: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang)
[20:35:14] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:841992|yiwiktionary: Adjust width-height ratio of logo to fix display issue (T310961)]]
[20:35:19] <stashbot>	 T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961
[20:35:38] <logmsgbot>	 !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:841992|yiwiktionary: Adjust width-height ratio of logo to fix display issue (T310961)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:35:39] <TheresNoTime>	 koi: live on mwdebug :)
[20:36:23] <koi>	 TheresNoTime: new logo LGTM
[20:36:31] <TheresNoTime>	 syncin'
[20:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:38:17] <icinga-wm>	 RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:40:31] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:841992|yiwiktionary: Adjust width-height ratio of logo to fix display issue (T310961)]] (duration: 05m 17s)
[20:40:36] <stashbot>	 T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961
[20:40:38] <TheresNoTime>	 all done
[20:41:09] <TheresNoTime>	 !log closing UTC late backport window
[20:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10KFrancis) Hi all, I just received this request.  Arian Bozorg does not yet have an NDA on file.  I will work on the agreement and let you know when it's complete.  Thanks!
[20:45:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10KFrancis) @Arian_Bozorg Please send me your WMDE email address to kfrancis@wikimedia.org as soon as possible.  Thanks@
[20:54:34] <jinxer-wm>	 (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:59:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:02:26] <wikibugs>	 (03PS2) 10Cwhite: logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099)
[21:06:10] <cwhite>	 !log clean up old db backups on grafana2001
[21:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[21:27:19] <wikibugs>	 (03PS1) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705)
[21:28:16] <wikibugs>	 (03PS2) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705)
[21:45:39] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Update Special:FundraiserRedirector tests for new behavior [puppet] - 10https://gerrit.wikimedia.org/r/842013
[21:48:05] <wikibugs>	 (03CR) 10RLazarus: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/842013 (owner: 10RLazarus)
[21:52:51] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[21:57:23] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[22:49:42] <wikibugs>	 (03PS1) 10Tim Starling: Migrate to PHP 7.4 case mapping, but retain Georgian overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842019 (https://phabricator.wikimedia.org/T292552)
[22:51:33] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:59:53] <icinga-wm>	 PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:00:44] <wikibugs>	 (03PS1) 10BryanDavis: buster: Fix image build failures found on 2022-10-12 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842020
[23:07:48] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] buster: Fix image build failures found on 2022-10-12 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842020 (owner: 10BryanDavis)
[23:08:09] <wikibugs>	 (03PS2) 10BryanDavis: mono68-sssd: New image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/840327 (https://phabricator.wikimedia.org/T311466) (owner: 10Majavah)
[23:08:27] <wikibugs>	 (03Merged) 10jenkins-bot: buster: Fix image build failures found on 2022-10-12 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842020 (owner: 10BryanDavis)
[23:11:16] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] mono68-sssd: New image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/840327 (https://phabricator.wikimedia.org/T311466) (owner: 10Majavah)
[23:12:22] <wikibugs>	 (03Merged) 10jenkins-bot: mono68-sssd: New image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/840327 (https://phabricator.wikimedia.org/T311466) (owner: 10Majavah)
[23:15:14] <wikibugs>	 (03PS2) 10BryanDavis: toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) (owner: 10Arturo Borrero Gonzalez)
[23:17:42] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) (owner: 10Arturo Borrero Gonzalez)
[23:18:18] <wikibugs>	 (03Merged) 10jenkins-bot: toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) (owner: 10Arturo Borrero Gonzalez)
[23:28:31] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:30:13] <wikibugs>	 (03Abandoned) 10BryanDavis: [WIP] Install yj in buster0 stack [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/637199 (https://phabricator.wikimedia.org/T266716) (owner: 10Legoktm)
[23:52:43] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook