[00:03:25] PROBLEM - Host cloudcephmon1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:03:55] PROBLEM - Host cloudvirt1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:04:07] PROBLEM - Host cp1081.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:06:19] PROBLEM - Host lvs1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:19:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:26:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:27:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:30:15] RECOVERY - Host cloudcephmon1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [00:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:37:33] (03CR) 10Ori: [C: 03+1] systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:39:05] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:45:59] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:38] PROBLEM - Host ores1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:46:42] PROBLEM - Host mw1316.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:46:44] PROBLEM - Host mw1314.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:46:44] PROBLEM - Host an-worker1098.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:46:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:52] PROBLEM - Host analytics1073.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:48:08] PROBLEM - Host ps1-b7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [01:48:16] PROBLEM - Host ms-be1041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:48:18] PROBLEM - Host ms-be1053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:49:42] PROBLEM - Host clouddumps1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:49:43] PROBLEM - Host cloudcephosd1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:49:48] PROBLEM - Host cloudvirt1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:49:52] PROBLEM - Host cloudvirt1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:50:10] PROBLEM - Host cp1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:50:12] PROBLEM - Host elastic1086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:50:12] PROBLEM - Host elastic1085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:50:32] PROBLEM - Host dbprov1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:50:46] PROBLEM - Host restbase-dev1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:02] PROBLEM - Host an-worker1087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:12] PROBLEM - Host an-worker1130.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:36] PROBLEM - Host kafka-main1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:56] PROBLEM - Host clouddb1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:52:08] PROBLEM - Host lvs1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:52:10] PROBLEM - Host mw1313.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:52:10] PROBLEM - Host mw1315.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:03:07] RECOVERY - Host kafka-main1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [02:03:37] RECOVERY - Host lvs1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms [02:06:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:52] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [02:29:51] RECOVERY - Host cloudcephosd1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [02:30:25] RECOVERY - Host elastic1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.10 ms [02:30:25] RECOVERY - Host elastic1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms [02:30:31] RECOVERY - Host dbprov1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.10 ms [02:30:55] RECOVERY - Host an-worker1130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [02:31:13] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:04:07] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [03:04:24] (03PS3) 10Andrew Bogott: Openstack Nova: Expose the Nova public API [puppet] - 10https://gerrit.wikimedia.org/r/838904 (https://phabricator.wikimedia.org/T319312) [03:20:59] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [03:21:07] (03PS3) 10Andrew Bogott: Openstack Glance: Expose the Glance public API [puppet] - 10https://gerrit.wikimedia.org/r/838905 (https://phabricator.wikimedia.org/T319312) [03:21:22] (03PS3) 10Andrew Bogott: Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312) [03:24:32] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Cinder: Expose the Cinder public API [puppet] - 10https://gerrit.wikimedia.org/r/838906 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [04:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:36:05] PROBLEM - DNS on lvs1018.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:39:45] PROBLEM - DNS on elastic1086.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.223 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:39:45] PROBLEM - DNS on elastic1085.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:46:51] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:51:43] PROBLEM - DNS on an-worker1130.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.156 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:56:53] PROBLEM - DNS on kafka-main1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.130 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:23] PROBLEM - DNS on dbprov1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.18 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:34:57] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:16:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4826 [06:17:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4826 [06:36:07] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:57:47] (03PS2) 10Muehlenhoff: opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838833 (https://phabricator.wikimedia.org/T308013) [07:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T0700). Please do the needful. [07:00:05] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] o/ [07:00:14] o/ [07:00:28] i can deploy, unless matthiasmullie wants to self-serve? [07:00:43] either works for me :p [07:01:31] matthiasmullie: go ahead then :D [07:01:39] starting! [07:01:55] (03CR) 10Muehlenhoff: [C: 03+2] opensearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/838833 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:01:57] matthiasmullie: fyi, we've a new deployment tool. `scap backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841515` will take care of everything for you [07:01:59] oh, right, new scap scripts! [07:02:25] yep yep [07:02:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841515 (https://phabricator.wikimedia.org/T320406) (owner: 10Matthias Mullie) [07:03:19] (03PS2) 10Muehlenhoff: maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013) [07:05:08] urbanecm: out of curiosity - I notice scap now handles merging the patch as well; what happens with patches that are already merged, or already +2ed and being merged soon? [07:05:38] 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey) Same thing this morning: ` elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "an-worker1086.mgmt.eqiad.wmnet" -U root -E chassis power status Unable to read password from environment... [07:05:48] Asking because some repos have CI that takes forever and I often +2ed half an hour in advance so it doesn't take up most of the deployment window [07:09:07] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:19:14] (03Merged) 10jenkins-bot: Rescale images based on width alone [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841515 (https://phabricator.wikimedia.org/T320406) (owner: 10Matthias Mullie) [07:19:49] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:841515|Rescale images based on width alone (T320406)]] [07:19:54] T320406: Thumbnails on SpecialSearch may fail to load - https://phabricator.wikimedia.org/T320406 [07:20:19] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:841515|Rescale images based on width alone (T320406)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:25:09] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:841515|Rescale images based on width alone (T320406)]] (duration: 05m 19s) [07:25:14] T320406: Thumbnails on SpecialSearch may fail to load - https://phabricator.wikimedia.org/T320406 [07:25:46] !log UTC morning backports done [07:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:38:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:40:58] (03CR) 10JMeybohm: [C: 04-1] Add a new production images for spark and spark-operator (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [07:46:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:55:05] (03CR) 10JMeybohm: [C: 03+1] "Sounds right." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841477 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [07:59:10] 10SRE, 10GitLab, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) [08:01:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2001.codfw.wmnet to plain [08:02:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2001.codfw.wmnet to plain [08:02:34] (03PS2) 10Muehlenhoff: logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) [08:04:57] (03CR) 10CI reject: [V: 04-1] logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:07:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2001.codfw.wmnet to drbd [08:08:09] (03PS1) 10Filippo Giunchedi: hieradata: clean up ganeti4001 references [puppet] - 10https://gerrit.wikimedia.org/r/841853 [08:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:11:07] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [08:16:05] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2001.codfw.wmnet to drbd [08:18:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1004.eqiad.wmnet to drbd [08:26:09] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #5 [puppet] - 10https://gerrit.wikimedia.org/r/841856 (https://phabricator.wikimedia.org/T317748) [08:27:59] (03Abandoned) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841134 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [08:28:21] (03Abandoned) 10Muehlenhoff: Make ganeti1032 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841127 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [08:28:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1004.eqiad.wmnet to drbd [08:30:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. 4003 will also be taken down soonish, but by then a replacement host in this rack should be present." [puppet] - 10https://gerrit.wikimedia.org/r/841853 (owner: 10Filippo Giunchedi) [08:31:26] (03CR) 10Filippo Giunchedi: [C: 03+2] "*nod* thanks for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/841853 (owner: 10Filippo Giunchedi) [08:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:33:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd [08:34:26] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37512/console" [puppet] - 10https://gerrit.wikimedia.org/r/841856 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:36:57] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #5 [puppet] - 10https://gerrit.wikimedia.org/r/841856 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:37:27] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Investigate issue with msw-b7-eqiad - https://phabricator.wikimedia.org/T320598 (10cmooney) p:05Triage→03Medium [08:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:38:48] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Investigate issue with msw-b7-eqiad - https://phabricator.wikimedia.org/T320598 (10cmooney) [08:42:28] (03CR) 10Awight: [C: 03+1] Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [08:43:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd [08:43:41] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:59] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [08:48:57] PROBLEM - Check systemd state on dse-k8s-etcd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:31] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [08:50:22] ACKNOWLEDGEMENT - SSH on restbase-dev1005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:22] ACKNOWLEDGEMENT - Host restbase-dev1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. [08:50:22] ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-B-phase-Z on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:22] ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-B-phase-Y on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:22] ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-B-phase-X on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:22] ACKNOWLEDGEMENT - ps1-b7-eqiad-infeed-load-tower-A-phase-Z on ps1-b7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:50:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:52:02] !log partitioning the ATS cache in cp[2033-2034], cp[6003,6011], cp[1081-1082], cp[5004,5010], cp[3056-3057], cp[4024,4028] - T317748 [08:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:07] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [08:52:11] ACKNOWLEDGEMENT - DNS on elastic1085.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.222 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:52:11] ACKNOWLEDGEMENT - DNS on elastic1086.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.223 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:52:11] ACKNOWLEDGEMENT - DNS on kafka-main1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.130 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:52:11] ACKNOWLEDGEMENT - DNS on lvs1018.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.209 ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:52:11] ACKNOWLEDGEMENT - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T320598 - The acknowledgement expires at: 2022-10-14 08:51:48. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:53:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1004.eqiad.wmnet to plain [08:54:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1004.eqiad.wmnet to plain [08:54:45] ACKNOWLEDGEMENT - DNS on an-worker1130.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.156 ayounsi https://phabricator.wikimedia.org/T320598 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:54:45] ACKNOWLEDGEMENT - DNS on dbprov1002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.3.18 ayounsi https://phabricator.wikimedia.org/T320598 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:56:26] (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (0320 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [08:56:48] _joe_: From my side https://gerrit.wikimedia.org/r/c/operations/puppet/+/841148 is ready to be merged now [08:57:02] I'm not entirely sure my rebase is correct [08:58:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain [08:59:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain [09:00:23] <_joe_> hoo: I'll take a look when I have a minute, thanks [09:01:27] Thanks :) [09:02:15] RECOVERY - Check systemd state on dse-k8s-etcd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:03] (03PS2) 10Urbanecm: SVG resources: Run svgo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841187 (https://phabricator.wikimedia.org/T320447) [09:05:25] jouncebot: nowandnext [09:05:25] No deployments scheduled for the next 3 hour(s) and 54 minute(s) [09:05:25] In 3 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1300) [09:05:28] !log disabling puppet on all kubernetes masters (incl. ml & dse) [09:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841187 (https://phabricator.wikimedia.org/T320447) (owner: 10Urbanecm) [09:06:05] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master remove apiserver_count [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:06:11] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:06:21] (03PS5) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) [09:06:38] (03Merged) 10jenkins-bot: SVG resources: Run svgo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841187 (https://phabricator.wikimedia.org/T320447) (owner: 10Urbanecm) [09:07:00] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841187|SVG resources: Run svgo (T320447)]] [09:07:05] T320447: Run svgo for all SVG resources in operations/mediawiki-config - https://phabricator.wikimedia.org/T320447 [09:07:25] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:841187|SVG resources: Run svgo (T320447)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [09:07:54] (03PS9) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [09:08:21] (03CR) 10Btullis: Add a new production images for spark and spark-operator (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:11:25] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:35] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841187|SVG resources: Run svgo (T320447)]] (duration: 04m 38s) [09:12:30] !log re-enabled puppet on all kubernetes masters (incl. ml & dse) [09:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] (03PS9) 10Urbanecm: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [09:17:41] (03CR) 10Urbanecm: [C: 03+2] logos: Cover wordmark/tagline in manage.py (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [09:18:29] (03Merged) 10jenkins-bot: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [09:19:52] (03PS2) 10Urbanecm: Replace wordmark/tagline with correct naming style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [09:20:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [09:20:33] Hi, i don't see a dedicated deployment window for restbase. What would be a good time to push a deployment today ? [09:20:57] (03Merged) 10jenkins-bot: Replace wordmark/tagline with correct naming style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [09:21:21] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:829561|Replace wordmark/tagline with correct naming style (T307705)]] [09:21:26] T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705 [09:21:44] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:829561|Replace wordmark/tagline with correct naming style (T307705)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:22:55] (03PS1) 10Daniel Kinzler: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) [09:22:59] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:23:16] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10MoritzMuehlenhoff) [09:24:14] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] haproxy: fix apt repository path [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841477 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [09:24:21] 10SRE, 10Infrastructure-Foundations: Decide on model for serving idm.wikimedia.org - https://phabricator.wikimedia.org/T320604 (10MoritzMuehlenhoff) [09:25:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:829561|Replace wordmark/tagline with correct naming style (T307705)]] (duration: 04m 20s) [09:26:18] !log draining ganeti1017 T311687 [09:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:22] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:27:44] (03CR) 10D3r1ck01: Beta: Switch VE on dewiki to direct mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [09:27:58] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:43] 10SRE, 10Infrastructure-Foundations: Figure out an HA setup for the IDM - https://phabricator.wikimedia.org/T320605 (10MoritzMuehlenhoff) [09:30:11] (03PS1) 10Daniel Kinzler: Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) [09:30:49] (03CR) 10CI reject: [V: 04-1] Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [09:32:05] (03PS2) 10Daniel Kinzler: Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) [09:38:10] (03PS2) 10Daniel Kinzler: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) [09:39:57] (03CR) 10Kosta Harlan: "Hi Mohd and Santhosh, I made this patch as a follow-up from Id265f3ff87a80128c07e824b49f3b972df21e2d2; AIUI this code isn't called (?) cur" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan) [09:41:29] (03CR) 10Mabualruz: [C: 03+1] "Looks good to me" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan) [09:51:03] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Hi, I'll be your SRE support for today, and will handle de/repooling, destroying th... [09:52:57] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:01:17] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [10:01:50] (03CR) 10Btullis: Add a new production images for spark and spark-operator (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [10:01:54] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [10:08:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre: issue confd per-template alerts [alerts] - 10https://gerrit.wikimedia.org/r/841549 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [10:08:55] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: issue confd per-template alerts [alerts] - 10https://gerrit.wikimedia.org/r/841549 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [10:13:21] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Destroy/apply done in staging: ` # helmfile -e staging status helmfile.yaml: basePa... [10:16:11] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:39] (03PS1) 10Muehlenhoff: Update README.Debian to reflect latest changes for U2F/6.6/OIDC [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841864 [10:19:43] (03PS1) 10Jgiannelos: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841865 [10:20:08] (03PS1) 10Filippo Giunchedi: confd: remove check_confd_template icinga check [puppet] - 10https://gerrit.wikimedia.org/r/841886 (https://phabricator.wikimedia.org/T314118) [10:20:10] (03PS1) 10Filippo Giunchedi: WIP mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) [10:20:17] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20115 [10:21:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20115 [10:22:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [10:25:49] (03PS1) 10Volans: sre.hosts.provision: fix separator for boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 [10:26:50] (03CR) 10MVernon: [C: 03+1] "Once this is running in prod, I would like a test case adding so we can check we don't break it in future. But that (obviously) needn't bl" [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [10:29:00] (03PS1) 10Muehlenhoff: Remove Ganeti role from ganeti1005 [puppet] - 10https://gerrit.wikimedia.org/r/841892 (https://phabricator.wikimedia.org/T320419) [10:29:17] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:35] (03PS1) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) [10:30:25] (03Abandoned) 10Matthias Mullie: Explicitly set wgPageImagesNamespaces to none where disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841133 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [10:30:37] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:32:04] (03CR) 10Ladsgroup: "beta cluster is fine but before production, let's go through it together and make some optimizations. e.g. adding some logs would be nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [10:32:08] (03PS2) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) [10:32:10] (03CR) 10Ladsgroup: [C: 03+1] Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [10:32:45] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:33:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:33:11] (03CR) 10Cparle: [C: 03+1] Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) (owner: 10Matthias Mullie) [10:33:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:33:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [10:33:24] !log depooling eventstreams in codfw - T310721 [10:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:29] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [10:33:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [10:33:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T318955)', diff saved to https://phabricator.wikimedia.org/P35418 and previous config saved to /var/cache/conftool/dbconfig/20221012-103338-ladsgroup.json [10:33:42] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:33:51] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams,name=codfw [10:35:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:36:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318955)', diff saved to https://phabricator.wikimedia.org/P35419 and previous config saved to /var/cache/conftool/dbconfig/20221012-103604-ladsgroup.json [10:39:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [10:39:57] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:34] !log jgiannelos@deploy1002 Started deploy [restbase/deploy@0474832]: Update restbase to 1a02cdfb [10:48:15] 10SRE, 10Traffic, 10observability: ATS Request Error Ratio SLI shows negative values - https://phabricator.wikimedia.org/T320615 (10Vgutierrez) [10:48:27] 10SRE, 10Traffic, 10observability: ATS Request Error Ratio SLI shows negative values - https://phabricator.wikimedia.org/T320615 (10Vgutierrez) p:05Triage→03Medium [10:49:23] !log installing dbus security updates [10:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:08] (03PS1) 10DDesouza: Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) [10:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P35420 and previous config saved to /var/cache/conftool/dbconfig/20221012-105111-ladsgroup.json [10:55:07] (03Abandoned) 10Muehlenhoff: Remove ganeti role from ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/841124 (owner: 10Muehlenhoff) [10:55:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [10:57:07] !log redeploying eventstreams codfw - T310721 [10:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:12] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [10:58:00] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [10:58:46] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [11:00:28] (03PS3) 10Ayounsi: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:01:43] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams,name=codfw [11:02:07] !log repooled eventstreams in codfw - T310721 [11:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P35421 and previous config saved to /var/cache/conftool/dbconfig/20221012-110617-ladsgroup.json [11:06:53] (03PS1) 10Vgutierrez: mtail::atsbackend: Make sure that sli_total is always incremented [puppet] - 10https://gerrit.wikimedia.org/r/841896 (https://phabricator.wikimedia.org/T320615) [11:07:20] (03PS2) 10Vgutierrez: mtail::atsbackend: Ensure that sli_total is always incremented [puppet] - 10https://gerrit.wikimedia.org/r/841896 (https://phabricator.wikimedia.org/T320615) [11:07:22] !log jgiannelos@deploy1002 Finished deploy [restbase/deploy@0474832]: Update restbase to 1a02cdfb (duration: 25m 48s) [11:08:52] (03CR) 10Santhosh: [C: 03+2] AddContributeCardEntryPoint: Use RequestContext::getMain [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan) [11:09:16] (03CR) 10Muehlenhoff: "One nit inline, but looks good in general" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:10:20] (03PS4) 10Ayounsi: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:11:11] !log installing bind9 security updates on buster (client side tools/libs) [11:11:14] (03CR) 10CI reject: [V: 04-1] admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:34] (03PS1) 10Ladsgroup: Add rename_flaggedrevs_indexes_T318950.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841899 (https://phabricator.wikimedia.org/T318950) [11:11:41] (03PS5) 10Ayounsi: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:11:50] (03CR) 10Ayounsi: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:16:00] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) @Clement_Goubert Thank you so much! Please let us know if there is anything we need... [11:20:56] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstream` redeployed in codfw. @JArguello-WMF Apart from checking everything i... [11:21:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T318955)', diff saved to https://phabricator.wikimedia.org/P35422 and previous config saved to /var/cache/conftool/dbconfig/20221012-112124-ladsgroup.json [11:21:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:21:29] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:21:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35423 and previous config saved to /var/cache/conftool/dbconfig/20221012-112146-ladsgroup.json [11:24:02] !log depooling eventstreams in eqiad - T310721 [11:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:06] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [11:24:14] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams,name=eqiad [11:28:43] (03Merged) 10jenkins-bot: AddContributeCardEntryPoint: Use RequestContext::getMain [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) (owner: 10Kosta Harlan) [11:44:06] !log redeploying eventstreams eqiad - T310721 [11:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:11] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [11:45:59] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [11:46:23] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [11:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35424 and previous config saved to /var/cache/conftool/dbconfig/20221012-114642-ladsgroup.json [11:46:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:48:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:50:12] !log repooling eventstreams in eqiad - T310721 [11:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:17] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [11:51:15] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams,name=eqiad [11:51:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1005.eqiad.wmnet with reason: Remove from cluster for eventual decom [11:51:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1005.eqiad.wmnet with reason: Remove from cluster for eventual decom [11:51:54] (03CR) 10Ayounsi: [C: 03+2] admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) (owner: 10AOkoth) [11:52:21] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstream` redeployed in eqiad [11:59:56] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Everything looks healthy from my end, both are getting traffic and not throwing err... [12:00:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10ayounsi) 05In progress→03Resolved a:03ayounsi Users added to the WMF LDAP group, as well as #wmf-nda. And the private-data-users in h... [12:01:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35425 and previous config saved to /var/cache/conftool/dbconfig/20221012-120148-ladsgroup.json [12:08:36] (03CR) 10Muehlenhoff: [C: 03+2] maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:11:17] (03PS2) 10Muehlenhoff: dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) [12:12:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS buster [12:13:05] (03PS2) 10WMDE-Fisch: Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) [12:14:22] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [12:15:45] (03PS5) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) [12:16:53] (03Abandoned) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829760 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [12:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35426 and previous config saved to /var/cache/conftool/dbconfig/20221012-121655-ladsgroup.json [12:17:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) Hi @KFrancis could you confirm that "User has a valid NDA on file with WMF legal" ? Thanks! [12:19:25] (03CR) 10Stang: "Also fix some merge conflict.." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [12:19:48] (03CR) 10Muehlenhoff: [C: 03+2] Remove Ganeti role from ganeti1005 [puppet] - 10https://gerrit.wikimedia.org/r/841892 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff) [12:20:03] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Alert on individual pybal backend hosts being down for a long time - https://phabricator.wikimedia.org/T320627 (10fgiunchedi) [12:25:21] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [12:28:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [12:28:20] (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Ensure that sli_total is always incremented [puppet] - 10https://gerrit.wikimedia.org/r/841896 (https://phabricator.wikimedia.org/T320615) (owner: 10Vgutierrez) [12:28:40] (03CR) 10Daniel Kinzler: Beta: Enable parsoid cache warming. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [12:32:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35427 and previous config saved to /var/cache/conftool/dbconfig/20221012-123201-ladsgroup.json [12:32:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [12:32:07] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:32:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [12:32:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T318955)', diff saved to https://phabricator.wikimedia.org/P35428 and previous config saved to /var/cache/conftool/dbconfig/20221012-123223-ladsgroup.json [12:32:25] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) > all the affected hosts are on stretch, but of the ~375 hosts we still have on stretch those are the o... [12:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:36:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) Nevermind, found the spreadsheet, NDA is there. @odimitrijevic or @Ottomata I need your approval as the request is for `analytics-privatedata-users` [12:37:02] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10MoritzMuehlenhoff) >>! In T290984#8311170, @SLyngshede-WMF wrote: >> all the affected hosts are on stretch, but of the... [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:39:00] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) 05In progress→03Resolved [12:40:44] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) > eventstreams-internal is still used? I am not sure! I'd imagine folks use it, as it is... [12:41:17] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) Closed due to Stretch hosts having gone away. [12:41:24] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:41:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Ottomata) Approved. [12:42:12] (03PS1) 10Filippo Giunchedi: sre: test warning on pybal backends being down for long [alerts] - 10https://gerrit.wikimedia.org/r/841905 (https://phabricator.wikimedia.org/T320627) [12:42:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) [12:43:30] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [12:45:33] (03CR) 10Muehlenhoff: [C: 03+2] dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:46:41] (03PS1) 10Ayounsi: admin: add manuel-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/841907 (https://phabricator.wikimedia.org/T320504) [12:48:50] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Create a cookbook to switch an instance to DRBD/plain disk storage - https://phabricator.wikimedia.org/T312116 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The cookbook has been created as sre.ganeti.changedisk and works fine. [12:48:52] 10SRE, 10Ganeti: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [12:49:03] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Fix dummy ssl_paths object [puppet] - 10https://gerrit.wikimedia.org/r/841908 [12:53:32] (03CR) 10MVernon: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/841907 (https://phabricator.wikimedia.org/T320504) (owner: 10Ayounsi) [12:53:36] (03PS2) 10JMeybohm: dragonfly::dfdaemon: Fix dummy ssl_paths object [puppet] - 10https://gerrit.wikimedia.org/r/841908 [12:54:05] (03CR) 10Ayounsi: [C: 03+2] admin: add manuel-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/841907 (https://phabricator.wikimedia.org/T320504) (owner: 10Ayounsi) [12:55:41] (03PS1) 10Jelto: gitlab_runner: restrict all internal traffic, not only TCP [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) [12:55:46] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:56:14] (03CR) 10CI reject: [V: 04-1] gitlab_runner: restrict all internal traffic, not only TCP [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:56:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10ayounsi) 05Open→03Resolved a:03ayounsi Give it 30min for the change to propagate and you should be all set. Please re-open if there are any issues. [12:56:45] (03PS2) 10Jelto: gitlab_runner: restrict all internal traffic, not only TCP [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) [12:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T318955)', diff saved to https://phabricator.wikimedia.org/P35429 and previous config saved to /var/cache/conftool/dbconfig/20221012-125725-ladsgroup.json [12:57:31] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:58:50] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37513/console" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1300). [13:00:04] WMDE-Fisch, danisztls, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] i can deploy today! [13:00:32] o/ [13:00:36] hi WMDE-Fisch! [13:00:40] o/ [13:00:54] Hi urbanecm. Would be great if you could deploy mine at least. [13:01:12] sure [13:01:20] (03PS3) 10Urbanecm: Enable show nearby feature on a small group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [13:01:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [13:02:11] koi: hi! should we also do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/829764? or do you prefer to do it at a later date? [13:02:33] urbanecm: I prefer a later date :) [13:02:39] sounds good [13:04:26] (03CR) 10Herron: [C: 03+1] rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway) [13:04:27] I'm pleasantly surprised how well scap backport handles things. got the "unexpected commits" screen [13:04:58] !log urbanecm@deploy1002 Backport cancelled. [13:04:59] o.O [13:05:05] (03PS1) 10Urbanecm: Revert "AddContributeCardEntryPoint: Use RequestContext::getMain" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841872 (https://phabricator.wikimedia.org/T319327) [13:05:14] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "AddContributeCardEntryPoint: Use RequestContext::getMain" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841872 (https://phabricator.wikimedia.org/T319327) (owner: 10Urbanecm) [13:05:18] jouncebot: nowandnext [13:05:18] For the next 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1300) [13:05:19] In 4 hour(s) and 54 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800) [13:05:19] In 4 hour(s) and 54 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800) [13:05:31] urbanecm: let me know once you're done [13:05:33] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [13:05:41] !log urbanecm@deploy1002 backport aborted: (duration: 00m 09s) [13:05:45] Amir1: will do [13:05:49] (03PS1) 10Jelto: gitlab_runner: add webproxy to allowed_services [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) [13:05:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841854 (https://phabricator.wikimedia.org/T316782) (owner: 10WMDE-Fisch) [13:06:19] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841854|Enable show nearby feature on a small group of wikis (T316782)]] [13:06:23] T316782: Deploy Show Nearby feature to small group of wikis - https://phabricator.wikimedia.org/T316782 [13:06:43] !log urbanecm@deploy1002 urbanecm and wmde-fisch: Backport for [[gerrit:841854|Enable show nearby feature on a small group of wikis (T316782)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:06:57] matthiasmullie: can you test at mwdebug1001, please? [13:07:07] (03PS9) 10Ottomata: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [13:08:10] (03CR) 10Ottomata: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [13:08:11] urbanecm: Doing that now ;-) [13:08:15] eh, sorry [13:08:22] thanks [13:08:56] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37514/console" [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:09:14] urbanecm: Works like a charm. Go on please! [13:09:18] syncing! [13:09:57] !log draining ganeti1007 T320419 [13:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:02] T320419: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 [13:10:53] (03PS6) 10Urbanecm: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:11:00] (03CR) 10Urbanecm: [C: 03+2] Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:11:56] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti1007 [puppet] - 10https://gerrit.wikimedia.org/r/841914 (https://phabricator.wikimedia.org/T320419) [13:12:22] * urbanecm doesn't see danisztls, will skip their patch [13:12:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P35430 and previous config saved to /var/cache/conftool/dbconfig/20221012-131232-ladsgroup.json [13:13:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841854|Enable show nearby feature on a small group of wikis (T316782)]] (duration: 07m 03s) [13:13:27] T316782: Deploy Show Nearby feature to small group of wikis - https://phabricator.wikimedia.org/T316782 [13:13:30] WMDE-Fisch: should be live! [13:13:33] (03Merged) 10jenkins-bot: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:13:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:14:00] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:829563|Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php (T307705)]] [13:14:04] T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705 [13:14:23] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:829563|Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php (T307705)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:14:34] koi: live at mwdebug1001, can you check please? [13:14:43] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts d-i-test.eqiad.wmnet [13:14:58] wondering how to check this... randomly pick some sites? [13:15:26] koi: yeah [13:15:59] (03CR) 10Ladsgroup: [C: 03+1] Beta: Enable parsoid cache warming. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [13:16:05] Hi. I'm late. Is there still time? [13:16:54] urbanecm: I checked nowikimedia, bnwikibooks, zhwiki, wikidatawiki, no issue found, so LGTM [13:16:59] great! [13:17:05] danisztls: yup yup [13:17:12] the https://gerrit.wikimedia.org/r/c/841895/, right? [13:17:26] urbanecm: yes :) [13:18:39] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:18:54] (03PS2) 10Urbanecm: Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [13:19:04] (03CR) 10Urbanecm: [C: 03+2] Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [13:20:35] (03Merged) 10jenkins-bot: Remove Research Incentive survey from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [13:21:07] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:829563|Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php (T307705)]] (duration: 07m 06s) [13:21:11] T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705 [13:21:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841895 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [13:21:26] koi: and live! [13:21:33] thanks! [13:21:37] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841895|Remove Research Incentive survey from eswiki (T318331)]] [13:21:42] T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331 [13:21:52] (03CR) 10Jelto: [V: 03+1] "a little bit more context in T295481#8311437" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:22:01] !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:841895|Remove Research Incentive survey from eswiki (T318331)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:22:01] let me just purge the static resources cache too [13:22:09] danisztls: your patch is at mwdebug1001, please test [13:22:24] urbanecm: lgtm [13:22:40] (03PS1) 10Vgutierrez: mtail::atsbackend: Fix TTFB regex [puppet] - 10https://gerrit.wikimedia.org/r/841917 (https://phabricator.wikimedia.org/T320615) [13:23:10] great, syncing [13:23:24] purged [13:24:10] (03PS2) 10Vgutierrez: mtail::atsbackend: Fix TTFB regex [puppet] - 10https://gerrit.wikimedia.org/r/841917 (https://phabricator.wikimedia.org/T320615) [13:24:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts d-i-test.eqiad.wmnet [13:24:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `d-i-test.eqiad.wmnet` - d-i-test.eqiad.wmnet (**WARN**) - //Host not found on Icinga, una... [13:25:45] (03PS1) 10Muehlenhoff: Remove d-i-test Puppet references [puppet] - 10https://gerrit.wikimedia.org/r/841918 [13:25:48] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841895|Remove Research Incentive survey from eswiki (T318331)]] (duration: 05m 21s) [13:27:03] T318331: Deploy Research Incentive Survey on Spanish Wikipedia - https://phabricator.wikimedia.org/T318331 [13:27:08] danisztls: and live [13:27:08] (03CR) 10Herron: [C: 03+1] "LGTM although would be nice to update ACLs eventually as well" [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:27:16] urbanecm: thanks! [13:27:18] Amir1: I'm done, over to you :) [13:27:29] 10SRE, 10observability: Overlap between "check systemd state" alert and "check unit status of " - https://phabricator.wikimedia.org/T319304 (10fgiunchedi) [13:27:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P35431 and previous config saved to /var/cache/conftool/dbconfig/20221012-132738-ladsgroup.json [13:27:52] 10Puppet, 10SRE, 10Infrastructure-Foundations: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) [13:28:03] thanks [13:28:09] I'm going to mess with mwdebug1001 [13:30:12] (03PS1) 10Muehlenhoff: Remove d-i-test from special handling [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 [13:30:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove d-i-test Puppet references [puppet] - 10https://gerrit.wikimedia.org/r/841918 (owner: 10Muehlenhoff) [13:34:29] (03PS1) 10Ladsgroup: rdbms: Instead of reconfiguring all of LB, just remove depooled db [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) [13:35:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 (owner: 10Volans) [13:36:40] (03CR) 10Ladsgroup: [C: 03+2] "Deploying to mwdebug only to test depool, will revert afterwards." [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [13:36:43] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: fix separator for boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 (owner: 10Volans) [13:41:29] (03Merged) 10jenkins-bot: sre.hosts.provision: fix separator for boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/841890 (owner: 10Volans) [13:42:10] (03CR) 10MVernon: [C: 03+2] deploy swift_ring_manager to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) (owner: 10Zabe) [13:42:40] 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10bking) a:05bking→03None [13:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T318955)', diff saved to https://phabricator.wikimedia.org/P35432 and previous config saved to /var/cache/conftool/dbconfig/20221012-134245-ladsgroup.json [13:42:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [13:42:50] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:43:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [13:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35433 and previous config saved to /var/cache/conftool/dbconfig/20221012-134306-ladsgroup.json [13:43:30] (03CR) 10Zabe: [C: 03+1] swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar) [13:43:55] (03PS1) 10Ssingh: hiera: use Linux 5.10 on cp4045 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/841923 (https://phabricator.wikimedia.org/T319067) [13:44:50] (03CR) 10MVernon: [C: 03+2] swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar) [13:44:57] 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10SLyngshede-WMF) 05Open→03Resolved Closed, @BTullis has submitted a patch and this hasn't been an issue since. We'll reopen th... [13:46:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841923 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [13:47:06] (03CR) 10Ssingh: [C: 03+2] hiera: use Linux 5.10 on cp4045 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/841923 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [13:47:28] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [13:47:56] (03PS2) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) [13:48:23] (03PS3) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) [13:48:59] (03PS1) 10Filippo Giunchedi: systemd: drop timer-specific alert in favor of generic alert [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) [13:49:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [13:49:38] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Gehel) [13:50:46] (03PS4) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) [13:52:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [13:52:40] (03CR) 10Filippo Giunchedi: "Mostly a proposal, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [13:52:47] (03Merged) 10jenkins-bot: rdbms: Instead of reconfiguring all of LB, just remove depooled db [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841873 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [13:53:12] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:841873|rdbms: Instead of reconfiguring all of LB, just remove depooled db (T298485)]] [13:53:17] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [13:53:36] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:841873|rdbms: Instead of reconfiguring all of LB, just remove depooled db (T298485)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:54:06] RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [13:54:48] RECOVERY - Host an-worker1098.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.53 ms [13:54:54] RECOVERY - Host clouddumps1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.75 ms [13:55:44] RECOVERY - Host cp1081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.72 ms [13:55:46] RECOVERY - Host ms-be1041.mgmt is UP: PING WARNING - Packet loss = 60%, RTA = 1.07 ms [13:56:24] RECOVERY - Host cp1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.60 ms [13:56:34] RECOVERY - Host ms-be1053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.54 ms [13:56:36] RECOVERY - Host restbase-dev1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.45 ms [13:56:44] RECOVERY - Host an-worker1087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [13:57:14] RECOVERY - Host clouddb1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [13:57:24] RECOVERY - Host mw1313.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [13:57:24] RECOVERY - Host mw1315.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms [13:59:26] RECOVERY - Host an-worker1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [13:59:48] RECOVERY - Host mw1314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [13:59:48] RECOVERY - Host mw1316.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [14:00:10] RECOVERY - Host ores1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [14:01:24] RECOVERY - Host analytics1073.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [14:01:24] RECOVERY - Host cloudvirt1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [14:01:24] RECOVERY - Host cloudvirt1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [14:01:25] RECOVERY - Host cloudvirt1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:02:58] RECOVERY - Host lvs1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [14:03:12] (03CR) 10Volans: [C: 03+1] "LGTM, thx!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 (owner: 10Muehlenhoff) [14:04:47] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [14:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P35434 and previous config saved to /var/cache/conftool/dbconfig/20221012-140626-ladsgroup.json [14:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1175', diff saved to https://phabricator.wikimedia.org/P35435 and previous config saved to /var/cache/conftool/dbconfig/20221012-140746-ladsgroup.json [14:08:14] !log ladsgroup@deploy1002 Sync cancelled. [14:08:32] (03PS1) 10Ladsgroup: Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 [14:08:38] (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 (owner: 10Ladsgroup) [14:09:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35436 and previous config saved to /var/cache/conftool/dbconfig/20221012-140903-ladsgroup.json [14:09:08] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:10:57] (03PS3) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) [14:13:32] (03PS1) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 [14:15:57] duesen: finally the live depool works ^^ [14:16:10] basically this is needed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/828577/ [14:17:11] (03CR) 10CI reject: [V: 04-1] wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 (owner: 10David Caro) [14:18:07] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster [14:18:36] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Investigate issue with msw-b7-eqiad - https://phabricator.wikimedia.org/T320598 (10cmooney) 05Open→03Resolved a:03cmooney @Jclark-ctr has replaced the switch and devices are now back online: `lines=10 cmooney@msw1-eqiad> show ethernet-switch... [14:19:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [14:24:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P35438 and previous config saved to /var/cache/conftool/dbconfig/20221012-142410-ladsgroup.json [14:29:07] (03Merged) 10jenkins-bot: Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 (owner: 10Ladsgroup) [14:30:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841874 (owner: 10Ladsgroup) [14:30:41] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:841874|Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db"]] [14:31:04] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:841874|Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:34:15] (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Fix TTFB regex [puppet] - 10https://gerrit.wikimedia.org/r/841917 (https://phabricator.wikimedia.org/T320615) (owner: 10Vgutierrez) [14:35:18] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:841874|Revert "rdbms: Instead of reconfiguring all of LB, just remove depooled db"]] (duration: 04m 37s) [14:39:04] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P35439 and previous config saved to /var/cache/conftool/dbconfig/20221012-143917-ladsgroup.json [14:39:29] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:41:17] (03PS2) 10DLynch: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) [14:42:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10phaultfinder) [14:45:39] RECOVERY - DNS on lvs1018.mgmt is OK: DNS OK: 0.017 seconds response time. lvs1018.mgmt.eqiad.wmnet returns 10.65.1.209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:49:55] RECOVERY - DNS on elastic1085.mgmt is OK: DNS OK: 0.010 seconds response time. elastic1085.mgmt.eqiad.wmnet returns 10.65.1.222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:49:55] RECOVERY - DNS on elastic1086.mgmt is OK: DNS OK: 0.010 seconds response time. elastic1086.mgmt.eqiad.wmnet returns 10.65.1.223 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:50:38] (03PS1) 10Muehlenhoff: Install 5.10 in late_setup.sh for next Gen PowerEdges [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) [14:50:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Manuel) Thank you all! [14:53:45] (03CR) 10BBlack: [C: 03+1] Install 5.10 in late_setup.sh for next Gen PowerEdges [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [14:54:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T318955)', diff saved to https://phabricator.wikimedia.org/P35440 and previous config saved to /var/cache/conftool/dbconfig/20221012-145423-ladsgroup.json [14:54:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [14:54:29] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:54:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [14:54:43] (03CR) 10Ssingh: [C: 03+1] Install 5.10 in late_setup.sh for next Gen PowerEdges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [14:54:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T318955)', diff saved to https://phabricator.wikimedia.org/P35441 and previous config saved to /var/cache/conftool/dbconfig/20221012-145445-ladsgroup.json [14:56:11] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster [14:56:19] (03CR) 10Muehlenhoff: [C: 03+2] Install 5.10 in late_setup.sh for next Gen PowerEdges [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [14:57:06] !log depooling eventstreams-internal codfw - T310721 [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:11] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [14:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318955)', diff saved to https://phabricator.wikimedia.org/P35442 and previous config saved to /var/cache/conftool/dbconfig/20221012-145711-ladsgroup.json [14:57:21] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams-internal,name=codfw [14:57:51] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:00:39] (03PS4) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) [15:00:50] (03CR) 10Muehlenhoff: [C: 03+2] Install 5.10 in late_setup.sh for next Gen PowerEdges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841936 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [15:03:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [15:03:21] ^ third time's a charm [15:03:27] RECOVERY - DNS on an-worker1130.mgmt is OK: DNS OK: 0.016 seconds response time. an-worker1130.mgmt.eqiad.wmnet returns 10.65.0.156 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:07:03] !log redeploying eventstreams-internal codfw - T310721 [15:07:03] (03PS23) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [15:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:07] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [15:07:12] (03CR) 10Filippo Giunchedi: "Please let me know what you think! The background/context is reducing per-host IRC alert spam, while at the same time keep the alerts rele" [alerts] - 10https://gerrit.wikimedia.org/r/841905 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [15:07:23] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [15:07:45] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [15:08:01] RECOVERY - DNS on kafka-main1002.mgmt is OK: DNS OK: 0.010 seconds response time. kafka-main1002.mgmt.eqiad.wmnet returns 10.65.3.130 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:08:13] (03PS1) 10Volans: sre.hosts.provision: make errors more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/841938 [15:09:20] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal,name=codfw [15:09:34] !log repooled eventstreams-internal codfw - T310721 [15:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P35443 and previous config saved to /var/cache/conftool/dbconfig/20221012-151217-ladsgroup.json [15:13:09] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [15:14:37] (03PS3) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) [15:15:26] !log hnowlan@deploy1002 Started deploy [restbase/deploy@2d002b3]: Add ig,bcl,bn,tl wikiquote, ig wiktionary T314641 [15:15:31] T314641: Add igwikiquote to RESTBase - https://phabricator.wikimedia.org/T314641 [15:15:51] RECOVERY - DNS on dbprov1002.mgmt is OK: DNS OK: 0.011 seconds response time. dbprov1002.mgmt.eqiad.wmnet returns 10.65.3.18 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:48] !log depooling eventstreams-internal eqiad - T310721 [15:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [15:16:55] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams-internal,name=eqiad [15:18:51] (03PS13) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:19:42] (03CR) 10CI reject: [V: 04-1] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:22:12] (03PS2) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) [15:23:04] (03PS14) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:23:29] !log redeploying eventstreams-internal eqiad - T310721 [15:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] T310721: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 [15:23:47] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [15:23:54] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [15:24:14] (03CR) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:24:36] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [15:24:41] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [15:25:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (one typo inline)" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [15:25:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [15:26:33] !log remove materialized .json files from schemas/event/primary - this should be a no-op as no clients should actually be using the json files. - T315674 [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:37] T315674: Remove materialized .json files from event schema repositories - https://phabricator.wikimedia.org/T315674 [15:26:37] (03PS1) 10Urbanecm: eswiki: Deploy mentorship to only 15% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841939 (https://phabricator.wikimedia.org/T285235) [15:26:40] jouncebot: nowandnext [15:26:41] No deployments scheduled for the next 2 hour(s) and 33 minute(s) [15:26:41] In 2 hour(s) and 33 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800) [15:26:41] In 2 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800) [15:26:55] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:27:01] ^^going to ship the above, it's time-sensitive for Growth^^ [15:27:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P35444 and previous config saved to /var/cache/conftool/dbconfig/20221012-152724-ladsgroup.json [15:27:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841939 (https://phabricator.wikimedia.org/T285235) (owner: 10Urbanecm) [15:27:54] (03CR) 10Hnowlan: thumbor: new service chart (0318 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:28:27] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams-internal_4992: Servers kubernetes1008.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/Py [15:28:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [15:29:45] (JobUnavailable) firing: Reduced availability for job swagger_check_eventstreams_internal_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:26] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [15:30:41] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [15:31:15] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #6 [puppet] - 10https://gerrit.wikimedia.org/r/841941 (https://phabricator.wikimedia.org/T317748) [15:31:28] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@2d002b3]: Add ig,bcl,bn,tl wikiquote, ig wiktionary T314641 (duration: 16m 02s) [15:31:32] T314641: Add igwikiquote to RESTBase - https://phabricator.wikimedia.org/T314641 [15:32:46] (03Merged) 10jenkins-bot: eswiki: Deploy mentorship to only 15% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841939 (https://phabricator.wikimedia.org/T285235) (owner: 10Urbanecm) [15:32:49] finally [15:33:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump version in Chart.yaml too, otherwise the changes will not be deployable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [15:33:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:841939|eswiki: Deploy mentorship to only 15% of users (T285235)]] [15:33:15] T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235 [15:33:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:33:32] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:841939|eswiki: Deploy mentorship to only 15% of users (T285235)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [15:33:33] (03PS1) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705) [15:33:50] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal,name=eqiad [15:34:13] Sorry for the eventstreams-internal alarms [15:34:45] (JobUnavailable) resolved: Reduced availability for job swagger_check_eventstreams_internal_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:33] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:841939|eswiki: Deploy mentorship to only 15% of users (T285235)]] (duration: 04m 23s) [15:39:11] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstreams-internal` fully redeployed, this task can probably be closed now. [15:39:19] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:48] (CertAlmostExpired) firing: Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:40:08] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37516/console" [puppet] - 10https://gerrit.wikimedia.org/r/841941 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [15:42:24] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) Thank you so much! [15:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T318955)', diff saved to https://phabricator.wikimedia.org/P35445 and previous config saved to /var/cache/conftool/dbconfig/20221012-154230-ladsgroup.json [15:42:32] (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [15:42:36] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:43:39] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:48] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:45:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #6 [puppet] - 10https://gerrit.wikimedia.org/r/841941 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [15:45:45] !log partitioning the ATS cache in cp[2031-2032], cp[6002,6010], cp[1079-1080], cp[5003,5009], cp[3054-3055], cp[4023,4032] - T317748 [15:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:51] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [15:46:16] (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES IN WMF.4 [skins/Vector] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838818 (https://phabricator.wikimedia.org/T317573) (owner: 10Jdlrobson) [15:46:50] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cp4045.ulsfo.wmnet with OS buster [15:47:08] eh [15:47:28] :) [15:47:33] stop breaking things sukhe ;P [15:47:43] ll [15:47:44] lol [15:48:02] vgutierrez: next time I will be more careful! [15:48:43] I think you missed the step where you're supposed to sing a lullaby to the newly installed server [15:49:07] vgutierrez: I am going to sing this https://www.youtube.com/watch?v=dQw4w9WgXcQ [15:49:19] ahhahaha [15:50:27] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr @Cmjohnson I am planning on moving all the links on cr[1-2]-eqaid from fpc4 to fpc3 for the once in both cr1-eqiad from FPC4 to FPC3 and cr2... [15:55:10] (03PS1) 10Volans: sre.hosts.reimage: increase Netbox polling [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 [15:55:46] (03CR) 10Ssingh: [C: 03+1] "Thank you for the quick patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 (owner: 10Volans) [15:55:50] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840583 [16:00:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10Andrew) Let's back off of this plan for OSDs. The two nics on hypervisors are control plane and data plane, whereas on the OSDs they're both dataplane (on... [16:01:13] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840584 [16:01:45] (03CR) 10Volans: [C: 03+2] "Mostly UI changes, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/841938 (owner: 10Volans) [16:01:52] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: increase Netbox polling [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 (owner: 10Volans) [16:03:31] (03PS2) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 [16:05:03] (03Merged) 10jenkins-bot: sre.hosts.provision: make errors more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/841938 (owner: 10Volans) [16:05:26] (03Merged) 10jenkins-bot: sre.hosts.reimage: increase Netbox polling [cookbooks] - 10https://gerrit.wikimedia.org/r/841943 (owner: 10Volans) [16:05:28] (03PS1) 10Elukey: Deploy Istio 1.9.5-6 Docker images to the ML clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/841944 (https://phabricator.wikimedia.org/T320468) [16:10:06] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840585 [16:12:00] (03CR) 10Elukey: [C: 03+2] Deploy Istio 1.9.5-6 Docker images to the ML clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/841944 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [16:19:04] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [16:22:24] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840584 (owner: 10PipelineBot) [16:22:37] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840583 (owner: 10PipelineBot) [16:28:28] (03PS1) 10Elukey: ml-services: update Docker images after code refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/841947 (https://phabricator.wikimedia.org/T320374) [16:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:33:59] 10SRE, 10Traffic, 10observability: ATS Request Error Ratio SLI shows negative values - https://phabricator.wikimedia.org/T320615 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez {F35564906} [16:34:42] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841966 [16:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:49:44] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [16:51:36] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840585 (owner: 10PipelineBot) [16:52:03] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841966 (owner: 10PipelineBot) [16:55:05] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [16:55:07] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [16:55:22] (03PS1) 10JHathaway: otrs_aliases.py: add postfix support [puppet] - 10https://gerrit.wikimedia.org/r/841950 [16:55:37] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [16:55:39] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [16:55:53] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/841950 (owner: 10JHathaway) [16:56:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/841950 (owner: 10JHathaway) [16:57:30] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841966 (owner: 10PipelineBot) [17:00:22] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [17:00:34] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [17:02:15] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [17:03:06] (03PS4) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [17:03:30] (03PS1) 10Ssingh: cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) [17:05:29] (03CR) 10Dzahn: [C: 04-1] "Class[Profile::Mariadb::Generic_server]: has no parameter named 'ensure'" [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [17:05:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) Thanks to @MoritzMuehlenhoff and @Volans for their help in resolving the buster Linux 5.10 issue! ` sukhe@cp4045:~$ uname -r 5.10.0-0.deb10.17-a... [17:06:32] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [17:06:55] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [17:07:08] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [17:07:20] (03PS10) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [17:07:47] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [17:08:36] (03CR) 10Btullis: Add a new production images for spark and spark-operator (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [17:08:41] (03PS5) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [17:09:01] (03PS2) 10Ssingh: cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) [17:09:15] (03CR) 10CI reject: [V: 04-1] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [17:14:08] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) a:05Dzahn→03Clement_Goubert Just for clarification, we are talking about the service named `apple-search` in service discovery... [17:14:33] (03PS6) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [17:16:59] (03PS7) 10Dzahn: vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) [17:17:13] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37520/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [17:17:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] vrts: allow installing a local mariadb server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/838912 (https://phabricator.wikimedia.org/T317059) (owner: 10Dzahn) [17:26:11] (03PS3) 10Ssingh: cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) [17:30:17] (03CR) 10BBlack: [C: 03+1] cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [17:31:27] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37522/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:32:39] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "ferm refreshed on gitlab-runner1003, no issues" [puppet] - 10https://gerrit.wikimedia.org/r/841910 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:35:31] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [17:35:37] (03PS3) 10Andrew Bogott: Openstack Neutron: Expose the Neutron public API [puppet] - 10https://gerrit.wikimedia.org/r/838907 (https://phabricator.wikimedia.org/T319312) [17:35:42] (03PS3) 10Andrew Bogott: Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312) [17:39:39] (03CR) 10Andrew Bogott: [C: 03+2] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [17:39:55] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate: Expose the Designate public API [puppet] - 10https://gerrit.wikimedia.org/r/838908 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [17:46:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37523/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:49:33] (03CR) 10Ssingh: [C: 03+2] cp4045: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/841952 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [17:53:50] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "/etc/ferm/conf.d/18_docker-allow-webproxy-codw-http and others have been created, ferm was refreshed, saw no issues. on gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/841912 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:57:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:57:46] PROBLEM - Check systemd state on kubernetes2012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:04] dduvall and ^demon: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800). [18:00:05] dduvall and ^demon: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T1800). [18:02:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:02:40] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:03:45] !log dduvall@deploy1002 deploy-promote aborted: (duration: 00m 07s) [18:03:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [18:03:55] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841957 (https://phabricator.wikimedia.org/T314194) [18:03:57] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841957 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [18:04:43] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841957 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [18:08:28] PROBLEM - Host mw1314.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [18:09:05] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.5 refs T314194 [18:09:10] T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194 [18:12:25] (03PS1) 10Cwhite: logstash: drop noisy envoy deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/841967 (https://phabricator.wikimedia.org/T320468) [18:12:43] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.5 refs T314194 (duration: 03m 38s) [18:16:33] (03CR) 10Cwhite: [C: 03+2] logstash: drop noisy envoy deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/841967 (https://phabricator.wikimedia.org/T320468) (owner: 10Cwhite) [18:22:40] RECOVERY - Check systemd state on kubernetes2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:24:55] 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10Jclark-ctr) a:03Jclark-ctr [18:25:07] (03PS1) 10Zabe: Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) [18:25:45] 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10Jclark-ctr) @BTullis @elukey Management switch failed today and was replaced can you verify if it is still not working for you? [18:26:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [18:27:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster [18:29:21] (03CR) 10Yahya: [C: 03+1] Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe) [18:32:42] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:40:26] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster [18:40:33] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors: - cp... [18:40:48] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:41:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [18:42:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster [18:47:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:49:25] (03CR) 10Andrew Bogott: [C: 04-1] "nits:" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:52:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:54:20] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:55:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10KFrancis) @ayounsi I am confirming Manuel Merz has an NDA on file. Please proceed with the access request. Thanks! [18:55:24] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:28] (03PS1) 10Cwhite: logstash: expand filter to drop more envoy deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/841968 (https://phabricator.wikimedia.org/T320468) [19:00:07] (03CR) 10Hashar: Send events to Wikimedia EventGate (036 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [19:02:32] (03CR) 10Cwhite: [C: 03+2] logstash: expand filter to drop more envoy deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/841968 (https://phabricator.wikimedia.org/T320468) (owner: 10Cwhite) [19:02:37] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:14] ^ looking [19:04:53] change in fundraising redirect behavior, I'll make sure it's intended and then update the tests [19:06:09] thanks! seemed like a simple test failure and hence I didn't bother to ping [19:06:21] (03PS2) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705) [19:07:31] yeah for sure, no urgency [19:12:41] (03CR) 10Andrew Bogott: [C: 04-1] "A few preliminary comments, nothing major!" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:15:46] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4045.ulsfo.wmnet with OS buster [19:15:54] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors: - cp... [19:16:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [19:16:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster [19:21:21] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841969 [19:21:22] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841970 [19:21:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:42] (03PS1) 10Stang: yiwiktionary: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) [19:29:01] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4045.ulsfo.wmnet with OS buster [19:29:09] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors: - cp... [19:31:07] (03CR) 10Hashar: Send events to Wikimedia EventGate (033 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [19:37:03] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:12] frtech confirms those httpbb failures are catching an expected change that just went out with 1.40.0-wmf.5, so I'll update the asserts [19:47:45] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:50:52] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10Eevans) >>! In T317417#8280934, @MusikAnimal wrote: >>>! In T317417#8280822, @Eevans wro... [19:51:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10serviceops-collab: Q2:rack/setup/install webperf1005.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) [19:51:34] 10SRE, 10serviceops: service implementation tracking: webperf1005.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) 05Open→03Stalled [19:52:18] 10SRE, 10serviceops: service implementation tracking: webperf2005.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10Dzahn) 05Open→03Stalled [19:52:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install webperf2005.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) [19:53:24] (03PS3) 10Samtar: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [19:58:29] (a little early but) I can deploy! :D [19:59:19] * bd808 looks at clock, looks at TheresNoTime, looks at timezone map, looks away ;) [19:59:42] TheresNoTime: It's not one I can test anything about, so if it merges and doesn't immediately cause errors it can go out. [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221012T2000). [20:00:05] kemayo, zabe, duesen, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] Kemayo: awesome, I'll wait for the window to start proper but will do yours first [20:00:09] oh, there :D [20:00:16] o/ [20:00:21] o/ [20:00:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [20:00:47] o/ [20:01:21] o/ [20:02:01] Can someone clarify whether config deployments for beta need scap? [20:02:18] (03Merged) 10jenkins-bot: Register the editattempt_block schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833442 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [20:02:48] !log samtar@deploy1002 Started scap: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] [20:02:53] T310390: Instrument blocked edit attempts - https://phabricator.wikimedia.org/T310390 [20:03:11] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:03:16] duesen: not really, they end up on the beta cluster automagically after they're +2'd [20:03:56] duesen: You should use `scap backport` on beta-only config changes to ensure that they get pulled down the the deploy server (to avoid an alert). They won't be synced. [20:03:57] Ok. But I guess it's still good to scap, since otherwise, files get out of whack on the prod servers, even if they aren't used there... [20:04:02] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1023 - https://phabricator.wikimedia.org/T319001 (10Jclark-ctr) 05Open→03Resolved [20:04:17] Kemayo: syncing 833442, nothing broken afaics [20:04:26] TheresNoTime: great, thanks! [20:04:36] dancy: oh, they won't be synced? are they excluded somehow? [20:05:07] well, they will eventually be synced during a subsequent sync that someone else might run [20:05:23] but `scap backport` will skip a needless sync if it detects a beta-only change. [20:05:33] magic... [20:05:54] zabe: your patch will be next FYI [20:06:16] (03PS2) 10Samtar: Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe) [20:07:49] duesen: are you wanting to self-deploy? :) [20:08:20] TheresNoTime: yea, I want to try the new magic thingy :) [20:08:30] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] (duration: 05m 42s) [20:08:35] T310390: Instrument blocked edit attempts - https://phabricator.wikimedia.org/T310390 [20:08:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe) [20:09:10] duesen: :D I'll just get the ones ahead of you done then it'll be all yours [20:09:28] (03Merged) 10jenkins-bot: Set $wgSitename for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841961 (https://phabricator.wikimedia.org/T319183) (owner: 10Zabe) [20:09:31] TheresNoTime: ok, let me know. [20:09:36] will do [20:09:55] !log samtar@deploy1002 Started scap: Backport for [[gerrit:841961|Set $wgSitename for bnwikiquote (T319183)]] [20:09:59] T319183: Create Wikiquote Bengali - https://phabricator.wikimedia.org/T319183 [20:10:18] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:841961|Set $wgSitename for bnwikiquote (T319183)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:10:20] zabe: live on mwdebug, can you test? [20:10:31] TheresNoTime, lgtm [20:10:34] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841971 [20:10:40] syncin' [20:11:03] TheresNoTime: I still need to be in the correct directory when doing the scap, right? [20:11:27] duesen: I don't think so, but I change to it out of habit anyway [20:11:51] i see [20:12:16] (a helpful answer, I know!) [20:14:36] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:841961|Set $wgSitename for bnwikiquote (T319183)]] (duration: 04m 40s) [20:14:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [20:14:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [20:15:11] koi: just doing yours now :) [20:15:49] (03Merged) 10jenkins-bot: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [20:16:09] (03Merged) 10jenkins-bot: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841942 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [20:16:24] TheresNoTime: this two patch I thought there's no need to be tested, so you could sync directly [20:16:34] !log samtar@deploy1002 Started scap: Backport for [[gerrit:829764|Drop unused wordmark/tagline (T307705)]], [[gerrit:841942|Re-download and optimize wordmark/tagline svg file (T307705)]] [20:16:39] T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705 [20:16:39] Thanks sammy :) [20:16:57] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:829764|Drop unused wordmark/tagline (T307705)]], [[gerrit:841942|Re-download and optimize wordmark/tagline svg file (T307705)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:16:58] koi: okay :) [20:19:16] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841969 (owner: 10PipelineBot) [20:19:21] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841970 (owner: 10PipelineBot) [20:19:26] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/841971 (owner: 10PipelineBot) [20:20:36] (03PS1) 10JHathaway: add dummy mysql password for postfix [labs/private] - 10https://gerrit.wikimedia.org/r/842002 [20:21:27] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:829764|Drop unused wordmark/tagline (T307705)]], [[gerrit:841942|Re-download and optimize wordmark/tagline svg file (T307705)]] (duration: 04m 53s) [20:21:35] koi: all done :) [20:21:38] duesen: all yours! [20:22:05] ok, let me see... [20:23:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [20:23:27] (03CR) 10JHathaway: [C: 03+2] add dummy mysql password for postfix [labs/private] - 10https://gerrit.wikimedia.org/r/842002 (owner: 10JHathaway) [20:23:30] (03CR) 10JHathaway: [V: 03+2 C: 03+2] add dummy mysql password for postfix [labs/private] - 10https://gerrit.wikimedia.org/r/842002 (owner: 10JHathaway) [20:23:47] (03Merged) 10jenkins-bot: Beta: Enable parsoid cache warming. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841859 (https://phabricator.wikimedia.org/T320535) (owner: 10Daniel Kinzler) [20:26:28] checking that beta didn't explode... [20:27:31] T320535 looks pretty interesting.. [20:27:32] T320535: Put Parsoid output into the ParserCache on the beta cluster and testwiki - https://phabricator.wikimedia.org/T320535 [20:28:03] ok, looking good. [20:28:07] moving on to the next one [20:28:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [20:28:43] hrrmm... Gerrit could not merge the change '841858' as is and could require a rebase [20:28:59] (03PS3) 10Daniel Kinzler: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) [20:29:08] (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [20:29:48] (03Merged) 10jenkins-bot: Beta: Switch VE on dewiki to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841858 (https://phabricator.wikimedia.org/T320531) (owner: 10Daniel Kinzler) [20:30:25] Testing VE on dewiki beta [20:31:10] TheresNoTime: I posted one more patch, could you please deploy that? thanks [20:31:15] duesen: ah, you'll need to wait for https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/413018/console (and the associated `beta-scap-sync-world` job) to finish [20:31:30] koi: sure, will do it after ^ :) [20:32:13] Looking good. [20:32:17] ok, all done! Thank you! [20:32:42] koi: which patch? :) [20:32:58] ah, 841992 [20:33:04] yep [20:33:04] (03PS3) 10Samtar: yiwiktionary: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [20:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:34:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [20:34:35] koi: assume you will be able to test this one? [20:34:44] yeah, I'll test this one [20:34:48] (03Merged) 10jenkins-bot: yiwiktionary: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841992 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [20:35:14] !log samtar@deploy1002 Started scap: Backport for [[gerrit:841992|yiwiktionary: Adjust width-height ratio of logo to fix display issue (T310961)]] [20:35:19] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [20:35:38] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:841992|yiwiktionary: Adjust width-height ratio of logo to fix display issue (T310961)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:35:39] koi: live on mwdebug :) [20:36:23] TheresNoTime: new logo LGTM [20:36:31] syncin' [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:38:17] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:40:31] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:841992|yiwiktionary: Adjust width-height ratio of logo to fix display issue (T310961)]] (duration: 05m 17s) [20:40:36] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [20:40:38] all done [20:41:09] !log closing UTC late backport window [20:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:10] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10KFrancis) Hi all, I just received this request. Arian Bozorg does not yet have an NDA on file. I will work on the agreement and let you know when it's complete. Thanks! [20:45:05] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10KFrancis) @Arian_Bozorg Please send me your WMDE email address to kfrancis@wikimedia.org as soon as possible. Thanks@ [20:54:34] (CertAlmostExpired) resolved: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:59:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:02:26] (03PS2) 10Cwhite: logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099) [21:06:10] !log clean up old db backups on grafana2001 [21:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:50] (03CR) 10Cwhite: [C: 03+2] logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [21:27:19] (03PS1) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) [21:28:16] (03PS2) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) [21:45:39] (03PS1) 10RLazarus: httpbb: Update Special:FundraiserRedirector tests for new behavior [puppet] - 10https://gerrit.wikimedia.org/r/842013 [21:48:05] (03CR) 10RLazarus: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/842013 (owner: 10RLazarus) [21:52:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:57:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:49:42] (03PS1) 10Tim Starling: Migrate to PHP 7.4 case mapping, but retain Georgian overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842019 (https://phabricator.wikimedia.org/T292552) [22:51:33] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:59:53] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:44] (03PS1) 10BryanDavis: buster: Fix image build failures found on 2022-10-12 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842020 [23:07:48] (03CR) 10BryanDavis: [C: 03+2] buster: Fix image build failures found on 2022-10-12 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842020 (owner: 10BryanDavis) [23:08:09] (03PS2) 10BryanDavis: mono68-sssd: New image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/840327 (https://phabricator.wikimedia.org/T311466) (owner: 10Majavah) [23:08:27] (03Merged) 10jenkins-bot: buster: Fix image build failures found on 2022-10-12 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842020 (owner: 10BryanDavis) [23:11:16] (03CR) 10BryanDavis: [C: 03+2] mono68-sssd: New image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/840327 (https://phabricator.wikimedia.org/T311466) (owner: 10Majavah) [23:12:22] (03Merged) 10jenkins-bot: mono68-sssd: New image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/840327 (https://phabricator.wikimedia.org/T311466) (owner: 10Majavah) [23:15:14] (03PS2) 10BryanDavis: toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) (owner: 10Arturo Borrero Gonzalez) [23:17:42] (03CR) 10BryanDavis: [C: 03+2] toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) (owner: 10Arturo Borrero Gonzalez) [23:18:18] (03Merged) 10jenkins-bot: toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) (owner: 10Arturo Borrero Gonzalez) [23:28:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:30:13] (03Abandoned) 10BryanDavis: [WIP] Install yj in buster0 stack [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/637199 (https://phabricator.wikimedia.org/T266716) (owner: 10Legoktm) [23:52:43] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook