[00:03:53] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:04:17] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:10:47] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:11:11] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:15:54] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:32:49] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:01] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:05] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:07] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:56:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:00:11] <wikibugs>	 (03PS5) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587)
[04:15:54] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[04:18:17] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:41:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Heyy just some quick thoughts and questions about testing with live user requests... - So, we could use CentralNotice to add the includer HTML comment string...
[05:01:09] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:18:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:19:29] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:22:13] <icinga-wm>	 PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:23:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:32:57] <wikibugs>	 (03PS1) 10ArielGlenn: keep minimum older sql/xml dump files on generation hosts [puppet] - 10https://gerrit.wikimedia.org/r/833625 (https://phabricator.wikimedia.org/T318206)
[05:36:47] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] keep minimum older sql/xml dump files on generation hosts [puppet] - 10https://gerrit.wikimedia.org/r/833625 (https://phabricator.wikimedia.org/T318206) (owner: 10ArielGlenn)
[05:53:55] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:57:41] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:00:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:02:21] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:05:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:07:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:34:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:35:37] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:39] <icinga-wm>	 RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:56:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Siko_WMDE) Hi @Ottomata,  Got the E-Mail!  Thank you :-)
[06:58:11] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:58:57] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T0700)
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:20] <urbanecm>	 indeed, nothing to do
[07:06:13] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:24:53] <icinga-wm>	 RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:05] <jouncebot>	 jnuche and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T0800).
[08:02:09] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191)
[08:02:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[08:02:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[08:07:22] <wikibugs>	 (03PS2) 10Awight: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841)
[08:07:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight)
[08:07:35] <hashar>	 !log Restarting Gerrit to clear stalled sockets in Zuul
[08:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:18] <wikibugs>	 (03PS3) 10Awight: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841)
[08:10:18] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833726 (https://phabricator.wikimedia.org/T314191)
[08:10:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833726 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[08:11:07] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833726 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[08:12:14] <wikibugs>	 (03Abandoned) 10Jaime Nuche: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot)
[08:14:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:15:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:15:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:15:30] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.2  refs T314191
[08:15:34] <stashbot>	 T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191
[08:15:54] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:15:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:19:33] <logmsgbot>	 !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.2  refs T314191 (duration: 04m 02s)
[08:21:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:22:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:22:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:24:29] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:25:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:51:17] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[08:55:57] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c4122.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:45] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:00:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:05:25] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:05:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:21:58] <wikibugs>	 (03PS7) 10Gmodena: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto)
[09:25:47] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:33:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:38:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:39:12] <wikibugs>	 (03PS1) 10ArielGlenn: start daily cleanup job of sql/xmldumps later in the day [puppet] - 10https://gerrit.wikimedia.org/r/833736 (https://phabricator.wikimedia.org/T318206)
[09:41:26] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] start daily cleanup job of sql/xmldumps later in the day [puppet] - 10https://gerrit.wikimedia.org/r/833736 (https://phabricator.wikimedia.org/T318206) (owner: 10ArielGlenn)
[09:48:34] <wikibugs>	 10SRE, 10Gerrit, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) >>! In T191183#8192855, @kostajh wrote: > Coming back to this again... since the Gravatar issue (T263161) is unlikely to move...
[10:06:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:06:49] <wikibugs>	 10SRE, 10Gerrit, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10kostajh) >>! In T191183#8249977, @hashar wrote: >>>! In T191183#8192855, @kostajh wrote: >> Coming back to this again... since the Gra...
[10:10:16] <wikibugs>	 (03PS8) 10Gmodena: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto)
[10:42:52] <wikibugs>	 10SRE: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10colewhite)
[10:42:55] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite)
[10:48:27] <wikibugs>	 10SRE, 10Observability-Metrics: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10colewhite) Tagging observability-metrics because while logging could handle it, but it may not be the most efficient way to get this information in.  We have SimpleJSON...
[10:50:31] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) 05In progress→03Resolved MVP achieved.  Further iterations and features should come in separately.
[11:09:07] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:16:09] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[11:18:25] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:29:03] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:45:15] <icinga-wm>	 PROBLEM - Host dns5002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:45:21] <icinga-wm>	 PROBLEM - Host dns5001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:45:25] <icinga-wm>	 PROBLEM - Host cp5013 is DOWN: PING CRITICAL - Packet loss = 100%
[11:45:37] <icinga-wm>	 RECOVERY - Host cp5013 is UP: PING WARNING - Packet loss = 60%, RTA = 306.24 ms
[11:45:37] <icinga-wm>	 RECOVERY - Host dns5001 is UP: PING WARNING - Packet loss = 33%, RTA = 306.06 ms
[11:45:39] <icinga-wm>	 RECOVERY - Host dns5002 is UP: PING OK - Packet loss = 0%, RTA = 305.89 ms
[11:45:54] <taavi>	 umm
[12:09:21] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:10:21] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:15:54] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:29:51] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:31:23] <wikibugs>	 (03PS2) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882)
[12:33:50] <wikibugs>	 (03PS3) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1300)
[13:00:05] <jouncebot>	 arlolra, abijeet, and zabe: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <Lucas_WMDE>	 o/
[13:00:22] <zabe>	 o/
[13:00:33] <Lucas_WMDE>	 I can deploy!
[13:00:34] <urbanecm>	 o/
[13:00:38] <urbanecm>	 I'm here, but mobile only
[13:00:44] <urbanecm>	 Lucas_WMDE: go for it!
[13:01:04] <arlolra>	 here
[13:01:21] <Lucas_WMDE>	 let’s start with the brave enwikivoyage pioneers
[13:01:27] <abijeet>	 o/
[13:01:48] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:01:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:02:33] <wikibugs>	 (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:03:05] <Lucas_WMDE>	 arlolra: the enwikivoyage change is on mwdebug1001, can you test it?
[13:03:15] <wikibugs>	 (03PS6) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587)
[13:03:16] <arlolra>	 Yes
[13:03:17] <wikibugs>	 (03PS1) 10Jforrester: Move non-variant wgMFNearby to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770
[13:03:19] <wikibugs>	 (03PS1) 10Jforrester: Move non-variant wgMFUseWikibase to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833771
[13:03:31] <Lucas_WMDE>	 👀
[13:04:54] <arlolra>	 Lucas_WMDE: looks good
[13:05:12] <Lucas_WMDE>	 ok, thanks!
[13:05:22] <arlolra>	 thank you
[13:05:46] <Lucas_WMDE>	 syncing
[13:09:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:09:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830707|Disable wgParserEnableLegacyMediaDOM on enwikivoyage (T314318)]] (turning on new-style media output) (duration: 04m 03s)
[13:09:50] <stashbot>	 T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318
[13:10:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:10:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:11:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:11:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Looks like there are some concerns that it should be possible to create message bundles without this right, but no objections to granting " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[13:11:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[13:12:28] <Lucas_WMDE>	 abijeet: the change is on mwdebug1001, please test
[13:12:35] <abijeet>	 ok, checking
[13:13:21] <Lucas_WMDE>	 (looks good on my end)
[13:14:28] <abijeet>	 Lucas_WMDE, looks good to me too
[13:14:33] <Lucas_WMDE>	 great, thanks!
[13:14:42] <Lucas_WMDE>	 syncing
[13:14:54] <abijeet>	 thank you!
[13:16:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:17:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:17:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:17:47] <Lucas_WMDE>	 zabe: do I understand it correctly that db08 is only a replica, and that’s why replacing it without readonly or anything is fine?
[13:18:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830817|Add editcontentmodel right for metawiki translation administrators (T311587)]] (duration: 03m 50s)
[13:18:34] <stashbot>	 T311587: WikiLearn: Integration checklist for MetaWiki - https://phabricator.wikimedia.org/T311587
[13:19:16] <zabe>	 Lucas_WMDE, yes
[13:19:33] <Lucas_WMDE>	 ok
[13:20:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:23:50] <Lucas_WMDE>	 grmbl, I’m trying to check if db09 has the necessary data but can’t get the mysql access to work
[13:23:57] <Lucas_WMDE>	 I’m probably just doing things wrong and being clueless
[13:24:21] <Lucas_WMDE>	 zabe: did you check the replication status?
[13:25:24] <Lucas_WMDE>	 aha, `sudo mysql enwiki` works ^^
[13:26:06] <zabe>	 `show slave status` looks good to me
[13:26:16] <Lucas_WMDE>	 enwiki MAX(rev_id) is the same on db08 and db09, and that’s a revision from this morning, after replication started according to SAL
[13:26:27] <Lucas_WMDE>	 so I think that’s good enough to merge the change
[13:26:32] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[13:26:37] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[13:27:18] <wikibugs>	 (03Merged) 10jenkins-bot: Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[13:28:16] <Lucas_WMDE>	 syncing in production (not that it’ll have any effect)
[13:29:34] <zabe>	 kicked beta-code-update-eqiad, let's see
[13:30:17] <Lucas_WMDE>	 https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/409940/console has the right mediawiki-config commit, at least
[13:30:19] <Lucas_WMDE>	 good start
[13:30:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:31:34] <Lucas_WMDE>	 uhm, take a look at this though https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/61653/console
[13:31:40] <Lucas_WMDE>	 Cannot access the database: Host '172.16.4.233' is not allowed to connect to this MariaDB server (deployment-db09)
[13:31:50] <Lucas_WMDE>	 zabe: ^
[13:31:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:31:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:32:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833461|Replace deployment-db08 with deployment-db09 (T318126)]] (Beta-only, replace one replica with another) (duration: 03m 56s)
[13:32:05] <stashbot>	 T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126
[13:32:34] <zabe>	 probably a missing grant
[13:32:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:34:37] <Lucas_WMDE>	 beta is down now (same error)
[13:34:46] <Lucas_WMDE>	 do you think you can fix the grant or should we roll back?
[13:34:59] <Lucas_WMDE>	 (I wouldn’t know how to fix it)
[13:35:28] <zabe>	 lemme try to fix it
[13:35:32] <Lucas_WMDE>	 ok, thanks
[13:37:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:42:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:42:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:42:22] <Lucas_WMDE>	 looks like db09 only has wikiadmin/wikiuser grants for localhost, whereas db08 has them for 172.16.% and 10.%
[13:42:31] <Lucas_WMDE>	 privilege_type is also different
[13:43:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:43:42] <zabe>	 yes
[13:43:47] <zabe>	 I am a bit confused
[13:45:20] <Lucas_WMDE>	 I’d roll back for now
[13:46:17] <zabe>	 sure
[13:46:28] <Lucas_WMDE>	 maybe we can add db09 with weight 0, and then it can be tested with `sql.php --replicadb deployment-db09`?
[13:46:31] <Lucas_WMDE>	 (not sure if that would work)
[13:46:46] <zabe>	 we can try
[13:46:59] <Lucas_WMDE>	 ok, do you want to upload the change or should I?
[13:47:04] <zabe>	 I can do it
[13:47:07] <Lucas_WMDE>	 ok
[13:48:48] <wikibugs>	 (03CR) 10Jbond: sre.discovery: use CNAME records for swift dns lookup (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 (owner: 10Giuseppe Lavagetto)
[13:49:51] <wikibugs>	 (03PS1) 10Zabe: Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126)
[13:50:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[13:50:35] <wikibugs>	 (03PS2) 10Zabe: Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126)
[13:51:05] <zabe>	 Lucas_WMDE, ^
[13:51:11] <Lucas_WMDE>	 ack, looking
[13:51:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s see if this works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[13:52:01] <zabe>	 (I am a bit confused, because the mysql.user table somehow is empty and thus adding grants fails)
[13:52:18] <Lucas_WMDE>	 o_O
[13:52:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[13:53:25] <Lucas_WMDE>	 syncing in production
[13:53:55] <Lucas_WMDE>	 beta-code-update-eqiad also running
[13:55:16] <Lucas_WMDE>	 I’ll kick off another update-databases
[13:56:03] <zabe>	 beta is back up
[13:56:17] <Lucas_WMDE>	 indeed
[13:57:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833776|Add back deployment-db08 (T318126)]] (Beta-only, restore old replica) (duration: 03m 48s)
[13:57:11] <Lucas_WMDE>	 and yay, `sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php sql.php enwiki --replicadb deployment-db09` produces the DBConnectionError
[13:57:13] <stashbot>	 T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126
[13:57:15] <Lucas_WMDE>	 (on mediawiki12)
[13:57:24] <Lucas_WMDE>	 so looks like it should be possible to test that way
[13:58:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:59:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:59:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:59:28] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:31] <zabe>	 ok
[13:59:35] <zabe>	 thanks for your help :)
[13:59:41] <Lucas_WMDE>	 np, good luck ^^
[13:59:54] <Lucas_WMDE>	 and thanks for working on improving Beta!
[14:00:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:08:11] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:18:54] <zabe>	 btw. a restart of mariadb fixed the missing grants
[14:28:02] <Lucas_WMDE>	 huh
[14:28:04] <Lucas_WMDE>	 ok
[14:28:23] <Lucas_WMDE>	 jouncebot: nowandnext
[14:28:23] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 31 minute(s)
[14:28:23] <jouncebot>	 In 3 hour(s) and 31 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800)
[14:28:23] <jouncebot>	 In 3 hour(s) and 31 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800)
[14:28:42] <Lucas_WMDE>	 if you want to send another config change to use db09 I think we could deploy that now
[14:29:01] <Lucas_WMDE>	 (zabe ^, I just saw your message from 10 minutes ago)
[14:29:23] <zabe>	 yes
[14:29:26] <zabe>	 will upload a patch
[14:29:28] <Lucas_WMDE>	 ok
[14:31:07] <wikibugs>	 (03PS1) 10Zabe: Pool deployment-db09, depool deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833783 (https://phabricator.wikimedia.org/T318126)
[14:32:11] <zabe>	 Lucas_WMDE, ^
[14:32:17] <Lucas_WMDE>	 ack, looking
[14:33:56] <Lucas_WMDE>	 replication seems to be working, both hosts have a new revid I just created
[14:35:05] <wikibugs>	 (03PS4) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882)
[14:35:12] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Pool deployment-db09, depool deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833783 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[14:37:04] <Lucas_WMDE>	 oof, Zuul is busy
[14:39:21] <wikibugs>	 (03Merged) 10jenkins-bot: Pool deployment-db09, depool deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833783 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[14:39:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[14:40:24] <Lucas_WMDE>	 syncing in production
[14:41:06] <Lucas_WMDE>	 (and I see the code-update is also running)
[14:43:17] <zabe>	 beta still seems to be online
[14:43:47] <wikibugs>	 (03PS2) 10Samtar: prometheus/alerts_beta.yml: Add HostDown alert [puppet] - 10https://gerrit.wikimedia.org/r/833782 (https://phabricator.wikimedia.org/T315695)
[14:44:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833783|Pool deployment-db09, depool deployment-db08 (T318126)]] (Beta-only, exchange one replica for another) (duration: 03m 48s)
[14:44:12] <stashbot>	 T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126
[14:44:49] <Lucas_WMDE>	 https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version claims MariaDB 10.6.8, which matches what I see in db09 (db08 seems to have 10.4)
[14:45:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:45:50] <Lucas_WMDE>	 (db07 is also still on 10.4 ofc)
[14:46:08] <zabe>	 looks good so far
[14:46:31] <Lucas_WMDE>	 syncing on production again because I’m a dummy
[14:46:39] <Lucas_WMDE>	 (harmless, SAL will explain)
[14:46:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:46:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:46:52] <Lucas_WMDE>	 after that I should be done, if anyone else is waiting to do things with the server
[14:47:17] <zabe>	 thanks again for your help
[14:47:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:49:14] <wikibugs>	 (03PS1) 10Majavah: Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792
[14:50:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833783|Pool deployment-db09, depool deployment-db08 (T318126)]] (Beta-only, exchange one replica for another) [*actually* sync it this time since I forgot to git rebase before the last sync 🤦] (duration: 03m 41s)
[14:50:09] <stashbot>	 T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126
[14:50:56] <Lucas_WMDE>	 ok, I’m done
[14:56:35] <Emperor>	 !log set thanos ring replicas to 3.75 T311690
[14:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:40] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[15:05:59] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-c340.scope,session-c38.scope,session-c386.scope,session-c42.scope,session-c430.scope,session-c432.scope,session-c435.scope,session-c441.scope,session-c443.scope,session-c471.scope,session-c476.scope,session-c60.scope,session-c67.scope,session-c68.scope,session-c69.scope,session-c71.scope,session-c77.scope MVernon nodes 
[15:05:59] <icinga-wm>	 d for decomm https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:59] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on ms-be2039 is CRITICAL: DISK CRITICAL - free space: / 1964 MB (3% inode=89%): /tmp 1964 MB (3% inode=89%): /var/tmp 1964 MB (3% inode=89%): MVernon nodes scheduled for decomm https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2039&var-datasource=codfw+prometheus/ops
[15:09:25] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:13:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Hmmm one more note... so, another option would be to add the includer inside banner content, rather than the base HTML. (Maybe that's what you meant by "havin...
[15:34:13] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:44:39] <icinga-wm>	 PROBLEM - SSH on db1098.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:51:02] <wikibugs>	 (03CR) 10Ottomata: Deploy Spark 3 conf and debian pkg to test cluster (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[16:13:58] <wikibugs>	 (03CR) 10Hashar: [C: 04-2] "https://gerrit-review.googlesource.com/c/gerrit/+/345017 got merged upstream in stable-3.4 branch but is obviously not released yet ;-]" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[16:15:54] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:35:25] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:45:51] <icinga-wm>	 RECOVERY - SSH on db1098.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:19:30] <wikibugs>	 (03PS1) 10Zabe: Remove deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833830 (https://phabricator.wikimedia.org/T318126)
[17:25:28] <wikibugs>	 (03CR) 10Urbanecm: "this will go out later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[17:26:09] <wikibugs>	 (03PS1) 10Esanders: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831
[17:26:18] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905)
[17:26:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders)
[17:27:47] <wikibugs>	 (03PS2) 10Esanders: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831
[17:29:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:44:34] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711)
[17:45:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson)
[17:54:39] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711)
[18:00:05] <jouncebot>	 jnuche and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800).
[18:00:05] <jouncebot>	 jnuche and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800).
[18:00:13] <dancy>	 o/
[18:00:18] <dancy>	 Pressing the button
[18:00:53] <dancy>	 Not pressing the button.  group1 is already at 1.40.0-wmf.2
[18:01:00] <wikibugs>	 (03CR) 10Sbailey: "Enabling dark launch Linter write of namespace and tag and template field code during recordLintJob on test2wiki. Please confirm this is t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey)
[18:14:24] <wikibugs>	 (03PS4) 10Samtar: [DNM] rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[18:15:46] <wikibugs>	 (03PS5) 10Samtar: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[18:18:56] <wikibugs>	 (03CR) 10Samtar: "Just noting that I tried to cherry pick from `production` to `production` and ended up rebasing (?) this inadvertently — I'm not entirely " [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[18:21:05] <wikibugs>	 (03PS1) 10DLynch: Enable DiscussionTools visual enhancements as beta on en/dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625)
[18:33:44] <wikibugs>	 (03PS1) 10Aqu: WIP Puppet test [puppet] - 10https://gerrit.wikimedia.org/r/833842
[18:34:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP Puppet test [puppet] - 10https://gerrit.wikimedia.org/r/833842 (owner: 10Aqu)
[18:37:13] <wikibugs>	 (03PS2) 10Aqu: WIP Puppet test [puppet] - 10https://gerrit.wikimedia.org/r/833842 (https://phabricator.wikimedia.org/T312882)
[18:38:44] <logmsgbot>	 !log nokafor@deploy1002 Started deploy [analytics/refinery@91d0cf8]: Regular analytics weekly train [analytics/refinery@91d0cf8]
[18:44:24] <logmsgbot>	 !log nokafor@deploy1002 Finished deploy [analytics/refinery@91d0cf8]: Regular analytics weekly train [analytics/refinery@91d0cf8] (duration: 05m 40s)
[18:50:49] <urbanecm>	 jouncebot: nowandnext
[18:50:49] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800)
[18:50:50] <jouncebot>	 For the next 1 hour(s) and 9 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800)
[18:50:50] <jouncebot>	 In 1 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T2000)
[18:51:20] <urbanecm>	 dancy: since the train seems to be done, may i ship something? or should i wait for later today?
[18:51:40] <dancy>	 The train is done.  The deploy server is all yours
[18:51:43] <urbanecm>	 thanks!
[18:51:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[18:53:46] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[18:55:28] <logmsgbot>	 !log nokafor@deploy1002 Started deploy [analytics/refinery@91d0cf8] (thin): Regular analytics weekly train THIN [analytics/refinery@91d0cf8]
[18:55:36] <logmsgbot>	 !log nokafor@deploy1002 Finished deploy [analytics/refinery@91d0cf8] (thin): Regular analytics weekly train THIN [analytics/refinery@91d0cf8] (duration: 00m 08s)
[18:56:45] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Do not switch eswiki to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833843 (https://phabricator.wikimedia.org/T310905)
[18:56:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Do not switch eswiki to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833843 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[18:56:59] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/833844
[18:57:55] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Do not switch eswiki to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833843 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[19:00:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:01:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:01:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:02:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:04:20] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b8b2ebd3933cb891b62bb6aea01b2342c017cec8: Growth: Switch pilot wikis to structured mentor list (T310905) (duration: 03m 59s)
[19:04:24] <stashbot>	 T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905
[19:05:19] <urbanecm>	 done
[19:07:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:08:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:08:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:09:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:27:40] <wikibugs>	 (03PS1) 10BCornwall: Add latency measurement program [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833848 (https://phabricator.wikimedia.org/T315536)
[19:29:07] <wikibugs>	 (03Abandoned) 10BCornwall: utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere)
[19:30:02] <Tpt>	 Hi! There is a small regression on Wikisource (the proofreading progress indicator ends up having a 0 width). I have pushed a fix for review: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/832705
[19:30:02] <Tpt>	 A +2 would be very welcomed to be able to push it as part of the the next window.
[19:33:21] <logmsgbot>	 !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@ce20ecd]: (no justification provided)
[19:33:31] <logmsgbot>	 !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@ce20ecd]: (no justification provided) (duration: 00m 10s)
[19:35:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) a:03BCornwall
[19:36:19] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: deployment-ms-be05 Swift object-replicator sync error: Connection refused - https://phabricator.wikimedia.org/T318268 (10TheresNoTime)
[19:39:03] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:39:56] <zabe>	 +2'ed
[19:41:39] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: deployment-ms-be05 Swift object-replicator sync error: Connection refused - https://phabricator.wikimedia.org/T318268 (10TheresNoTime) I added ` Ingress  IPv4  TCP  6000  172.16.0.0/21 ` to the `swift-be` [[ https://horizon.wikimedia.org/project/security_grou...
[19:44:17] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: deployment-ms-be05 Swift object-replicator sync error: Connection refused - https://phabricator.wikimedia.org/T318268 (10TheresNoTime) 05Open→03Resolved a:03TheresNoTime Crossing fingers and assuming that's all it was  (**//why??//**)
[19:50:49] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy)
[19:55:58] <wikibugs>	 (03PS1) 10BCornwall: Convert camel-case function names to snake case [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833851
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T2000). Please do the needful.
[20:00:05] <jouncebot>	 zabe and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <zabe>	 o/
[20:00:28] <Kemayo>	 👋🏻
[20:00:30] <TheresNoTime>	 hiya
[20:00:59] <TheresNoTime>	 I can deploy! Gimme a sec, just closing a million tabs
[20:01:09] <ebernhardson>	 silly me, i put my patch in the earlier deploy window, i have on too (it's deployment-prep only, no-op in prod)
[20:01:54] <TheresNoTime>	 no worries ebernhardson !
[20:02:40] <TheresNoTime>	 I'll do yours first zabe :) see how much it breaks /s
[20:03:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833830 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[20:03:55] <zabe>	 ;)
[20:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: Remove deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833830 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe)
[20:04:12] <zabe>	 there should actually be no real chance of stuff breaking since deployment-db08 is already depooled
[20:04:30] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:833830|Remove deployment-db08 (T318126)]]
[20:04:34] <stashbot>	 T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126
[20:04:55] <logmsgbot>	 !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:833830|Remove deployment-db08 (T318126)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:05:24] <TheresNoTime>	 zabe: going to just go ahead and sync
[20:05:47] <zabe>	 yup
[20:06:14] <wikibugs>	 (03PS2) 10Samtar: Enable DiscussionTools visual enhancements as beta on en/dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) (owner: 10DLynch)
[20:09:46] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833830|Remove deployment-db08 (T318126)]] (duration: 05m 16s)
[20:09:50] <stashbot>	 T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126
[20:10:06] <TheresNoTime>	 Done :) Kemayo: you're up next :)
[20:10:07] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:10:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:10:22] <Kemayo>	 TheresNoTime: Exciting!
[20:10:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) (owner: 10DLynch)
[20:11:15] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements as beta on en/dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) (owner: 10DLynch)
[20:11:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:41] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:833837|Enable DiscussionTools visual enhancements as beta on en/dewiki (T315625)]]
[20:11:45] <stashbot>	 T315625: [Config Change] Enable Topic Containers as beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T315625
[20:12:05] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:833837|Enable DiscussionTools visual enhancements as beta on en/dewiki (T315625)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:12:26] <TheresNoTime>	 Kemayo: live on mwdebug1001 if you could test please :)
[20:12:55] <Kemayo>	 TheresNoTime: Looks good
[20:13:04] <TheresNoTime>	 syncing 
[20:14:41] <TheresNoTime>	 zabe: FYI seeing `Wikimedia\Rdbms\LoadMonitor::computeServerStates: host deployment-db08 is not replicating?` errors
[20:15:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:15:11] <TheresNoTime>	 (from `deployment-jobrunner04`)
[20:15:31] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:15:38] <zabe>	 hmm
[20:15:54] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[20:16:21] <TheresNoTime>	 and `Server deployment-db08 has 25 seconds of lag (>= 6)` which is perhaps a little inaccurate :p
[20:16:31] <zabe>	 I just disabled replication there
[20:16:50] <zabe>	 but it should no longer be used
[20:17:13] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833837|Enable DiscussionTools visual enhancements as beta on en/dewiki (T315625)]] (duration: 05m 31s)
[20:17:17] <stashbot>	 T315625: [Config Change] Enable Topic Containers as beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T315625
[20:17:57] <wikibugs>	 (03PS1) 10Gergő Tisza: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833809 (https://phabricator.wikimedia.org/T317343)
[20:18:08] <TheresNoTime>	 ebernhardson: up next, beta only right?
[20:18:12] <wikibugs>	 (03PS1) 10Gergő Tisza: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833810 (https://phabricator.wikimedia.org/T317343)
[20:18:22] <wikibugs>	 (03PS4) 10Samtar: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson)
[20:18:24] <ebernhardson>	 TheresNoTime: yups, it's just the InitialiseSettings-labs.php file
[20:19:26] <wikibugs>	 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10Antoine_Quhen)
[20:19:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson)
[20:20:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:20:23] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson)
[20:20:49] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:833463|cirrus: Limit shard count to 1 in deployment-prep (T316711)]]
[20:20:52] <stashbot>	 T316711: Reduce shard count on all wikis in beta cluster to 1 - https://phabricator.wikimedia.org/T316711
[20:21:12] <logmsgbot>	 !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:833463|cirrus: Limit shard count to 1 in deployment-prep (T316711)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:21:21] <TheresNoTime>	 (syncing)
[20:21:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:21:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:22:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:23:01] <tgr_>	 I have some last-minute additions to the backport window. Can deploy myself if preferred.
[20:24:09] <TheresNoTime>	 tgr_: I don't mind deploying :)
[20:24:42] <ebernhardson>	 TheresNoTime: thanks!
[20:24:51] <TheresNoTime>	 ebernhardson: no worries :)
[20:24:58] <tgr_>	 Thanks TheresNoTime ! Just EventGate schema version changes, don't need testing.
[20:25:09] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833463|cirrus: Limit shard count to 1 in deployment-prep (T316711)]] (duration: 04m 19s)
[20:25:17] <TheresNoTime>	 Awesome, starting now :)
[20:26:04] <wikibugs>	 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10RLazarus) I was wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/831230 might be related but I couldn't work out exactly what's going on here.  As @...
[20:26:41] <TheresNoTime>	 tgr_: ah hm, this would be a first for me as there's dependencies on this deploy (I think?) — would you mind self-deploying?
[20:27:22] <tgr_>	 Sure.
[20:27:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:28:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:28:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:28:48] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833809 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza)
[20:28:51] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833810 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza)
[20:29:01] <TheresNoTime>	 (is the process exactly the same?)
[20:29:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:30:32] <tgr_>	 in this case the dependency doesn't really matter, it's to the schemas/event repo and that has its own deployment process (which runs automatically on commit)
[20:30:48] <wikibugs>	 (03Merged) 10jenkins-bot: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833809 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza)
[20:30:50] <wikibugs>	 (03Merged) 10jenkins-bot: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833810 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza)
[20:30:50] <tgr_>	 usually, the process is the same, you just need to make sure to deploy the dependency first
[20:31:01] <TheresNoTime>	 ah! thank you :)
[20:31:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:36:53] <logmsgbot>	 !log tgr@deploy1002 Synchronized php-1.40.0-wmf.1/extensions/WikimediaEvents/includes/BlockMetrics/BlockMetricsHooks.php: Backport: [[gerrit:833809|Block metrics: Bump schema to un-require some fields (T317343)]] (duration: 03m 55s)
[20:36:57] <stashbot>	 T317343: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343
[20:39:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:43:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:44:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:44:58] <logmsgbot>	 !log tgr@deploy1002 Synchronized php-1.40.0-wmf.2/extensions/WikimediaEvents/includes/BlockMetrics/BlockMetricsHooks.php: Backport: [[gerrit:833810|Block metrics: Bump schema to un-require some fields (T317343)]] (duration: 03m 42s)
[20:45:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:45:01] <stashbot>	 T317343: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343
[20:46:37] <tgr_>	 !log UTC late deploys done
[20:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:07] <wikibugs>	 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Harej)
[20:49:22] <wikibugs>	 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Harej)
[20:50:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:50:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:50:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:51:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:05:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:10:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:17:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:22:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:28:59] <wikibugs>	 (03PS1) 10BCornwall: readme: Cleanups, clarifications, and typo fixes [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833855
[21:31:38] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: fix wmcs environment name [puppet] - 10https://gerrit.wikimedia.org/r/833856 (https://phabricator.wikimedia.org/T318281)
[21:32:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix wmcs environment name [puppet] - 10https://gerrit.wikimedia.org/r/833856 (https://phabricator.wikimedia.org/T318281) (owner: 10Jbond)
[21:39:27] <wikibugs>	 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10jbond) >>! In T318281#8251715, @RLazarus wrote: > I was wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/831230 might be relat...
[21:49:25] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy)
[22:18:39] <wikibugs>	 (03PS1) 10Legoktm: Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/833812
[22:18:55] <wikibugs>	 (03PS2) 10Legoktm: Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/833812
[22:34:11] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270)
[22:34:13] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270)
[22:35:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper)
[22:47:51] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:17] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:58:19] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270)
[22:58:21] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270)
[22:59:05] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270)
[22:59:07] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270)
[23:20:21] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:23:27] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:23:27] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:24:29] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:25:04] <wikibugs>	 (03CR) 10Arlolra: Enable Linter write of namespace tag and template fields on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey)
[23:25:27] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:31:24] <wikibugs>	 (03CR) 10Arlolra: Enable Linter write of namespace tag and template fields on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey)
[23:36:20] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Verified this is the same key that was removed in the original commit, so I'm forgoing the usual video chat to confirm identity." [puppet] - 10https://gerrit.wikimedia.org/r/833812 (owner: 10Legoktm)
[23:41:51] <rzl>	 legoktm: wb :) merged, and ran puppet manually on the bastions, feel free to test if you're around -- I'll let the other hosts update normally
[23:44:39] <wikibugs>	 (03PS1) 10Zabe: beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126)
[23:44:41] <wikibugs>	 (03PS1) 10Zabe: beta: Pool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833866 (https://phabricator.wikimedia.org/T318126)
[23:45:54] <wikibugs>	 (03PS2) 10Zabe: beta: Pool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833866 (https://phabricator.wikimedia.org/T318126)