[00:03:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:04:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:10:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:11:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:32:49] RECOVERY - Disk space on dumpsdata1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:01] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:05] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:07] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:56:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:00:11] (03PS5) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) [04:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:18:17] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:41:15] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Heyy just some quick thoughts and questions about testing with live user requests... - So, we could use CentralNotice to add the includer HTML comment string... [05:01:09] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:18:59] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:19:29] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:22:13] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:32:57] (03PS1) 10ArielGlenn: keep minimum older sql/xml dump files on generation hosts [puppet] - 10https://gerrit.wikimedia.org/r/833625 (https://phabricator.wikimedia.org/T318206) [05:36:47] (03CR) 10ArielGlenn: [C: 03+2] keep minimum older sql/xml dump files on generation hosts [puppet] - 10https://gerrit.wikimedia.org/r/833625 (https://phabricator.wikimedia.org/T318206) (owner: 10ArielGlenn) [05:53:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:57:41] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:02:21] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:05:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:34:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:37] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:39] RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:56:06] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Siko_WMDE) Hi @Ottomata, Got the E-Mail! Thank you :-) [06:58:11] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:57] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:20] indeed, nothing to do [07:06:13] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:24:53] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:05] jnuche and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T0800). [08:02:09] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) [08:02:11] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [08:02:26] (03CR) 10CI reject: [V: 04-1] group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [08:07:22] (03PS2) 10Awight: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) [08:07:31] (03CR) 10CI reject: [V: 04-1] Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight) [08:07:35] !log Restarting Gerrit to clear stalled sockets in Zuul [08:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:18] (03PS3) 10Awight: Enable QuickSurveys on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) [08:10:18] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833726 (https://phabricator.wikimedia.org/T314191) [08:10:20] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833726 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [08:11:07] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833726 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [08:12:14] (03Abandoned) 10Jaime Nuche: group1 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833707 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [08:14:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:15:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:15:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:15:30] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.2 refs T314191 [08:15:34] T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191 [08:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:15:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:19:33] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.2 refs T314191 (duration: 04m 02s) [08:21:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:22:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:22:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:24:29] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:25:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:51:17] (03CR) 10Nikerabbit: [C: 03+1] Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [08:55:57] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c4122.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:45] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:05:25] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:21:58] (03PS7) 10Gmodena: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [09:25:47] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:33:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:38:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:39:12] (03PS1) 10ArielGlenn: start daily cleanup job of sql/xmldumps later in the day [puppet] - 10https://gerrit.wikimedia.org/r/833736 (https://phabricator.wikimedia.org/T318206) [09:41:26] (03CR) 10ArielGlenn: [C: 03+2] start daily cleanup job of sql/xmldumps later in the day [puppet] - 10https://gerrit.wikimedia.org/r/833736 (https://phabricator.wikimedia.org/T318206) (owner: 10ArielGlenn) [09:48:34] 10SRE, 10Gerrit, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) >>! In T191183#8192855, @kostajh wrote: > Coming back to this again... since the Gravatar issue (T263161) is unlikely to move... [10:06:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:06:49] 10SRE, 10Gerrit, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10kostajh) >>! In T191183#8249977, @hashar wrote: >>>! In T191183#8192855, @kostajh wrote: >> Coming back to this again... since the Gra... [10:10:16] (03PS8) 10Gmodena: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [10:42:52] 10SRE: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10colewhite) [10:42:55] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [10:48:27] 10SRE, 10Observability-Metrics: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10colewhite) Tagging observability-metrics because while logging could handle it, but it may not be the most efficient way to get this information in. We have SimpleJSON... [10:50:31] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) 05In progress→03Resolved MVP achieved. Further iterations and features should come in separately. [11:09:07] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:16:09] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:18:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:29:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:45:15] PROBLEM - Host dns5002 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:21] PROBLEM - Host dns5001 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:25] PROBLEM - Host cp5013 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:37] RECOVERY - Host cp5013 is UP: PING WARNING - Packet loss = 60%, RTA = 306.24 ms [11:45:37] RECOVERY - Host dns5001 is UP: PING WARNING - Packet loss = 33%, RTA = 306.06 ms [11:45:39] RECOVERY - Host dns5002 is UP: PING OK - Packet loss = 0%, RTA = 305.89 ms [11:45:54] umm [12:09:21] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:10:21] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:29:51] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:31:23] (03PS2) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) [12:33:50] (03PS3) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1300) [13:00:05] arlolra, abijeet, and zabe: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:22] o/ [13:00:33] I can deploy! [13:00:34] o/ [13:00:38] I'm here, but mobile only [13:00:44] Lucas_WMDE: go for it! [13:01:04] here [13:01:21] let’s start with the brave enwikivoyage pioneers [13:01:27] o/ [13:01:48] (03PS4) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:01:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:02:33] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:03:05] arlolra: the enwikivoyage change is on mwdebug1001, can you test it? [13:03:15] (03PS6) 10Abijeet Patro: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) [13:03:16] Yes [13:03:17] (03PS1) 10Jforrester: Move non-variant wgMFNearby to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770 [13:03:19] (03PS1) 10Jforrester: Move non-variant wgMFUseWikibase to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833771 [13:03:31] 👀 [13:04:54] Lucas_WMDE: looks good [13:05:12] ok, thanks! [13:05:22] thank you [13:05:46] syncing [13:09:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:47] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830707|Disable wgParserEnableLegacyMediaDOM on enwikivoyage (T314318)]] (turning on new-style media output) (duration: 04m 03s) [13:09:50] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [13:10:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Looks like there are some concerns that it should be possible to create message bundles without this right, but no objections to granting " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [13:11:57] (03Merged) 10jenkins-bot: Add editcontentmodel right for metawiki translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [13:12:28] abijeet: the change is on mwdebug1001, please test [13:12:35] ok, checking [13:13:21] (looks good on my end) [13:14:28] Lucas_WMDE, looks good to me too [13:14:33] great, thanks! [13:14:42] syncing [13:14:54] thank you! [13:16:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:17:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:17:47] zabe: do I understand it correctly that db08 is only a replica, and that’s why replacing it without readonly or anything is fine? [13:18:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830817|Add editcontentmodel right for metawiki translation administrators (T311587)]] (duration: 03m 50s) [13:18:34] T311587: WikiLearn: Integration checklist for MetaWiki - https://phabricator.wikimedia.org/T311587 [13:19:16] Lucas_WMDE, yes [13:19:33] ok [13:20:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:23:50] grmbl, I’m trying to check if db09 has the necessary data but can’t get the mysql access to work [13:23:57] I’m probably just doing things wrong and being clueless [13:24:21] zabe: did you check the replication status? [13:25:24] aha, `sudo mysql enwiki` works ^^ [13:26:06] `show slave status` looks good to me [13:26:16] enwiki MAX(rev_id) is the same on db08 and db09, and that’s a revision from this morning, after replication started according to SAL [13:26:27] so I think that’s good enough to merge the change [13:26:32] (03PS2) 10Lucas Werkmeister (WMDE): Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:26:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:27:18] (03Merged) 10jenkins-bot: Replace deployment-db08 with deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833461 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:28:16] syncing in production (not that it’ll have any effect) [13:29:34] kicked beta-code-update-eqiad, let's see [13:30:17] https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/409940/console has the right mediawiki-config commit, at least [13:30:19] good start [13:30:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:31:34] uhm, take a look at this though https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/61653/console [13:31:40] Cannot access the database: Host '172.16.4.233' is not allowed to connect to this MariaDB server (deployment-db09) [13:31:50] zabe: ^ [13:31:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:32:01] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833461|Replace deployment-db08 with deployment-db09 (T318126)]] (Beta-only, replace one replica with another) (duration: 03m 56s) [13:32:05] T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126 [13:32:34] probably a missing grant [13:32:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:34:37] beta is down now (same error) [13:34:46] do you think you can fix the grant or should we roll back? [13:34:59] (I wouldn’t know how to fix it) [13:35:28] lemme try to fix it [13:35:32] ok, thanks [13:37:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:42:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:42:22] looks like db09 only has wikiadmin/wikiuser grants for localhost, whereas db08 has them for 172.16.% and 10.% [13:42:31] privilege_type is also different [13:43:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:42] yes [13:43:47] I am a bit confused [13:45:20] I’d roll back for now [13:46:17] sure [13:46:28] maybe we can add db09 with weight 0, and then it can be tested with `sql.php --replicadb deployment-db09`? [13:46:31] (not sure if that would work) [13:46:46] we can try [13:46:59] ok, do you want to upload the change or should I? [13:47:04] I can do it [13:47:07] ok [13:48:48] (03CR) 10Jbond: sre.discovery: use CNAME records for swift dns lookup (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 (owner: 10Giuseppe Lavagetto) [13:49:51] (03PS1) 10Zabe: Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) [13:50:33] (03CR) 10CI reject: [V: 04-1] Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:50:35] (03PS2) 10Zabe: Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) [13:51:05] Lucas_WMDE, ^ [13:51:11] ack, looking [13:51:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s see if this works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:52:01] (I am a bit confused, because the mysql.user table somehow is empty and thus adding grants fails) [13:52:18] o_O [13:52:31] (03Merged) 10jenkins-bot: Add back deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833776 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:53:25] syncing in production [13:53:55] beta-code-update-eqiad also running [13:55:16] I’ll kick off another update-databases [13:56:03] beta is back up [13:56:17] indeed [13:57:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833776|Add back deployment-db08 (T318126)]] (Beta-only, restore old replica) (duration: 03m 48s) [13:57:11] and yay, `sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php sql.php enwiki --replicadb deployment-db09` produces the DBConnectionError [13:57:13] T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126 [13:57:15] (on mediawiki12) [13:57:24] so looks like it should be possible to test that way [13:58:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:59:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:59:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:59:28] !log UTC afternoon backport+config window done [13:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:31] ok [13:59:35] thanks for your help :) [13:59:41] np, good luck ^^ [13:59:54] and thanks for working on improving Beta! [14:00:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:08:11] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:54] btw. a restart of mariadb fixed the missing grants [14:28:02] huh [14:28:04] ok [14:28:23] jouncebot: nowandnext [14:28:23] No deployments scheduled for the next 3 hour(s) and 31 minute(s) [14:28:23] In 3 hour(s) and 31 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800) [14:28:23] In 3 hour(s) and 31 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800) [14:28:42] if you want to send another config change to use db09 I think we could deploy that now [14:29:01] (zabe ^, I just saw your message from 10 minutes ago) [14:29:23] yes [14:29:26] will upload a patch [14:29:28] ok [14:31:07] (03PS1) 10Zabe: Pool deployment-db09, depool deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833783 (https://phabricator.wikimedia.org/T318126) [14:32:11] Lucas_WMDE, ^ [14:32:17] ack, looking [14:33:56] replication seems to be working, both hosts have a new revid I just created [14:35:05] (03PS4) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) [14:35:12] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Pool deployment-db09, depool deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833783 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [14:37:04] oof, Zuul is busy [14:39:21] (03Merged) 10jenkins-bot: Pool deployment-db09, depool deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833783 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [14:39:25] (03CR) 10CI reject: [V: 04-1] Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [14:40:24] syncing in production [14:41:06] (and I see the code-update is also running) [14:43:17] beta still seems to be online [14:43:47] (03PS2) 10Samtar: prometheus/alerts_beta.yml: Add HostDown alert [puppet] - 10https://gerrit.wikimedia.org/r/833782 (https://phabricator.wikimedia.org/T315695) [14:44:08] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833783|Pool deployment-db09, depool deployment-db08 (T318126)]] (Beta-only, exchange one replica for another) (duration: 03m 48s) [14:44:12] T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126 [14:44:49] https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version claims MariaDB 10.6.8, which matches what I see in db09 (db08 seems to have 10.4) [14:45:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:45:50] (db07 is also still on 10.4 ofc) [14:46:08] looks good so far [14:46:31] syncing on production again because I’m a dummy [14:46:39] (harmless, SAL will explain) [14:46:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:46:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:46:52] after that I should be done, if anyone else is waiting to do things with the server [14:47:17] thanks again for your help [14:47:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:49:14] (03PS1) 10Majavah: Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 [14:50:05] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/db-labs.php: Config: [[gerrit:833783|Pool deployment-db09, depool deployment-db08 (T318126)]] (Beta-only, exchange one replica for another) [*actually* sync it this time since I forgot to git rebase before the last sync 🤦] (duration: 03m 41s) [14:50:09] T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126 [14:50:56] ok, I’m done [14:56:35] !log set thanos ring replicas to 3.75 T311690 [14:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:40] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [15:05:59] ACKNOWLEDGEMENT - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-c340.scope,session-c38.scope,session-c386.scope,session-c42.scope,session-c430.scope,session-c432.scope,session-c435.scope,session-c441.scope,session-c443.scope,session-c471.scope,session-c476.scope,session-c60.scope,session-c67.scope,session-c68.scope,session-c69.scope,session-c71.scope,session-c77.scope MVernon nodes [15:05:59] d for decomm https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:59] ACKNOWLEDGEMENT - Disk space on ms-be2039 is CRITICAL: DISK CRITICAL - free space: / 1964 MB (3% inode=89%): /tmp 1964 MB (3% inode=89%): /var/tmp 1964 MB (3% inode=89%): MVernon nodes scheduled for decomm https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2039&var-datasource=codfw+prometheus/ops [15:09:25] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:13:32] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Hmmm one more note... so, another option would be to add the includer inside banner content, rather than the base HTML. (Maybe that's what you meant by "havin... [15:34:13] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:39] PROBLEM - SSH on db1098.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:02] (03CR) 10Ottomata: Deploy Spark 3 conf and debian pkg to test cluster (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [16:13:58] (03CR) 10Hashar: [C: 04-2] "https://gerrit-review.googlesource.com/c/gerrit/+/345017 got merged upstream in stable-3.4 branch but is obviously not released yet ;-]" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [16:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:35:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:51] RECOVERY - SSH on db1098.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:30] (03PS1) 10Zabe: Remove deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833830 (https://phabricator.wikimedia.org/T318126) [17:25:28] (03CR) 10Urbanecm: "this will go out later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [17:26:09] (03PS1) 10Esanders: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 [17:26:18] (03PS2) 10Urbanecm: Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) [17:26:53] (03CR) 10CI reject: [V: 04-1] Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders) [17:27:47] (03PS2) 10Esanders: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 [17:29:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:34] (03PS2) 10Ebernhardson: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) [17:45:06] (03CR) 10CI reject: [V: 04-1] cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [17:54:39] (03PS3) 10Ebernhardson: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) [18:00:05] jnuche and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800). [18:00:05] jnuche and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800). [18:00:13] o/ [18:00:18] Pressing the button [18:00:53] Not pressing the button. group1 is already at 1.40.0-wmf.2 [18:01:00] (03CR) 10Sbailey: "Enabling dark launch Linter write of namespace and tag and template field code during recordLintJob on test2wiki. Please confirm this is t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [18:14:24] (03PS4) 10Samtar: [DNM] rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [18:15:46] (03PS5) 10Samtar: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [18:18:56] (03CR) 10Samtar: "Just noting that I tried to cherry pick from `production` to `production` and ended up rebasing (?) this inadvertently — I'm not entirely " [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [18:21:05] (03PS1) 10DLynch: Enable DiscussionTools visual enhancements as beta on en/dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) [18:33:44] (03PS1) 10Aqu: WIP Puppet test [puppet] - 10https://gerrit.wikimedia.org/r/833842 [18:34:37] (03CR) 10CI reject: [V: 04-1] WIP Puppet test [puppet] - 10https://gerrit.wikimedia.org/r/833842 (owner: 10Aqu) [18:37:13] (03PS2) 10Aqu: WIP Puppet test [puppet] - 10https://gerrit.wikimedia.org/r/833842 (https://phabricator.wikimedia.org/T312882) [18:38:44] !log nokafor@deploy1002 Started deploy [analytics/refinery@91d0cf8]: Regular analytics weekly train [analytics/refinery@91d0cf8] [18:44:24] !log nokafor@deploy1002 Finished deploy [analytics/refinery@91d0cf8]: Regular analytics weekly train [analytics/refinery@91d0cf8] (duration: 05m 40s) [18:50:49] jouncebot: nowandnext [18:50:49] For the next 0 hour(s) and 9 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800) [18:50:50] For the next 1 hour(s) and 9 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T1800) [18:50:50] In 1 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T2000) [18:51:20] dancy: since the train seems to be done, may i ship something? or should i wait for later today? [18:51:40] The train is done. The deploy server is all yours [18:51:43] thanks! [18:51:50] (03CR) 10Urbanecm: [C: 03+2] Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [18:53:46] (03Merged) 10jenkins-bot: Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [18:55:28] !log nokafor@deploy1002 Started deploy [analytics/refinery@91d0cf8] (thin): Regular analytics weekly train THIN [analytics/refinery@91d0cf8] [18:55:36] !log nokafor@deploy1002 Finished deploy [analytics/refinery@91d0cf8] (thin): Regular analytics weekly train THIN [analytics/refinery@91d0cf8] (duration: 00m 08s) [18:56:45] (03PS1) 10Urbanecm: Growth: Do not switch eswiki to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833843 (https://phabricator.wikimedia.org/T310905) [18:56:57] (03CR) 10Urbanecm: [C: 03+2] Growth: Do not switch eswiki to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833843 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [18:56:59] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/833844 [18:57:55] (03Merged) 10jenkins-bot: Growth: Do not switch eswiki to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833843 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [19:00:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:01:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:01:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:02:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:04:20] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b8b2ebd3933cb891b62bb6aea01b2342c017cec8: Growth: Switch pilot wikis to structured mentor list (T310905) (duration: 03m 59s) [19:04:24] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [19:05:19] done [19:07:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:08:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:08:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:09:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:27:40] (03PS1) 10BCornwall: Add latency measurement program [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833848 (https://phabricator.wikimedia.org/T315536) [19:29:07] (03Abandoned) 10BCornwall: utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [19:30:02] Hi! There is a small regression on Wikisource (the proofreading progress indicator ends up having a 0 width). I have pushed a fix for review: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/832705 [19:30:02] A +2 would be very welcomed to be able to push it as part of the the next window. [19:33:21] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@ce20ecd]: (no justification provided) [19:33:31] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@ce20ecd]: (no justification provided) (duration: 00m 10s) [19:35:35] 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) a:03BCornwall [19:36:19] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: deployment-ms-be05 Swift object-replicator sync error: Connection refused - https://phabricator.wikimedia.org/T318268 (10TheresNoTime) [19:39:03] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:39:56] +2'ed [19:41:39] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: deployment-ms-be05 Swift object-replicator sync error: Connection refused - https://phabricator.wikimedia.org/T318268 (10TheresNoTime) I added ` Ingress IPv4 TCP 6000 172.16.0.0/21 ` to the `swift-be` [[ https://horizon.wikimedia.org/project/security_grou... [19:44:17] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: deployment-ms-be05 Swift object-replicator sync error: Connection refused - https://phabricator.wikimedia.org/T318268 (10TheresNoTime) 05Open→03Resolved a:03TheresNoTime Crossing fingers and assuming that's all it was (**//why??//**) [19:50:49] (03CR) 10Zabe: [C: 03+1] scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy) [19:55:58] (03PS1) 10BCornwall: Convert camel-case function names to snake case [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833851 [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220921T2000). Please do the needful. [20:00:05] zabe and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:00:28] 👋🏻 [20:00:30] hiya [20:00:59] I can deploy! Gimme a sec, just closing a million tabs [20:01:09] silly me, i put my patch in the earlier deploy window, i have on too (it's deployment-prep only, no-op in prod) [20:01:54] no worries ebernhardson ! [20:02:40] I'll do yours first zabe :) see how much it breaks /s [20:03:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833830 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [20:03:55] ;) [20:03:59] (03Merged) 10jenkins-bot: Remove deployment-db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833830 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [20:04:12] there should actually be no real chance of stuff breaking since deployment-db08 is already depooled [20:04:30] !log samtar@deploy1002 Started scap: Backport for [[gerrit:833830|Remove deployment-db08 (T318126)]] [20:04:34] T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126 [20:04:55] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:833830|Remove deployment-db08 (T318126)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:05:24] zabe: going to just go ahead and sync [20:05:47] yup [20:06:14] (03PS2) 10Samtar: Enable DiscussionTools visual enhancements as beta on en/dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) (owner: 10DLynch) [20:09:46] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833830|Remove deployment-db08 (T318126)]] (duration: 05m 16s) [20:09:50] T318126: Migrate deployment-prep db hosts to bullseye - https://phabricator.wikimedia.org/T318126 [20:10:06] Done :) Kemayo: you're up next :) [20:10:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:10:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:22] TheresNoTime: Exciting! [20:10:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) (owner: 10DLynch) [20:11:15] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements as beta on en/dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833837 (https://phabricator.wikimedia.org/T315625) (owner: 10DLynch) [20:11:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:41] !log samtar@deploy1002 Started scap: Backport for [[gerrit:833837|Enable DiscussionTools visual enhancements as beta on en/dewiki (T315625)]] [20:11:45] T315625: [Config Change] Enable Topic Containers as beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T315625 [20:12:05] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:833837|Enable DiscussionTools visual enhancements as beta on en/dewiki (T315625)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:12:26] Kemayo: live on mwdebug1001 if you could test please :) [20:12:55] TheresNoTime: Looks good [20:13:04] syncing [20:14:41] zabe: FYI seeing `Wikimedia\Rdbms\LoadMonitor::computeServerStates: host deployment-db08 is not replicating?` errors [20:15:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:11] (from `deployment-jobrunner04`) [20:15:31] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:15:38] hmm [20:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:16:21] and `Server deployment-db08 has 25 seconds of lag (>= 6)` which is perhaps a little inaccurate :p [20:16:31] I just disabled replication there [20:16:50] but it should no longer be used [20:17:13] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833837|Enable DiscussionTools visual enhancements as beta on en/dewiki (T315625)]] (duration: 05m 31s) [20:17:17] T315625: [Config Change] Enable Topic Containers as beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T315625 [20:17:57] (03PS1) 10Gergő Tisza: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833809 (https://phabricator.wikimedia.org/T317343) [20:18:08] ebernhardson: up next, beta only right? [20:18:12] (03PS1) 10Gergő Tisza: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833810 (https://phabricator.wikimedia.org/T317343) [20:18:22] (03PS4) 10Samtar: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:18:24] TheresNoTime: yups, it's just the InitialiseSettings-labs.php file [20:19:26] 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10Antoine_Quhen) [20:19:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:20:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:23] (03Merged) 10jenkins-bot: cirrus: Limit shard count to 1 in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833463 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:20:49] !log samtar@deploy1002 Started scap: Backport for [[gerrit:833463|cirrus: Limit shard count to 1 in deployment-prep (T316711)]] [20:20:52] T316711: Reduce shard count on all wikis in beta cluster to 1 - https://phabricator.wikimedia.org/T316711 [20:21:12] !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:833463|cirrus: Limit shard count to 1 in deployment-prep (T316711)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:21:21] (syncing) [20:21:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:01] I have some last-minute additions to the backport window. Can deploy myself if preferred. [20:24:09] tgr_: I don't mind deploying :) [20:24:42] TheresNoTime: thanks! [20:24:51] ebernhardson: no worries :) [20:24:58] Thanks TheresNoTime ! Just EventGate schema version changes, don't need testing. [20:25:09] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:833463|cirrus: Limit shard count to 1 in deployment-prep (T316711)]] (duration: 04m 19s) [20:25:17] Awesome, starting now :) [20:26:04] 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10RLazarus) I was wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/831230 might be related but I couldn't work out exactly what's going on here. As @... [20:26:41] tgr_: ah hm, this would be a first for me as there's dependencies on this deploy (I think?) — would you mind self-deploying? [20:27:22] Sure. [20:27:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:28:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:28:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:48] (03CR) 10Gergő Tisza: [C: 03+2] Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833809 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza) [20:28:51] (03CR) 10Gergő Tisza: [C: 03+2] Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833810 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza) [20:29:01] (is the process exactly the same?) [20:29:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:30:32] in this case the dependency doesn't really matter, it's to the schemas/event repo and that has its own deployment process (which runs automatically on commit) [20:30:48] (03Merged) 10jenkins-bot: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833809 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza) [20:30:50] (03Merged) 10jenkins-bot: Block metrics: Bump schema to un-require some fields [extensions/WikimediaEvents] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833810 (https://phabricator.wikimedia.org/T317343) (owner: 10Gergő Tisza) [20:30:50] usually, the process is the same, you just need to make sure to deploy the dependency first [20:31:01] ah! thank you :) [20:31:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:53] !log tgr@deploy1002 Synchronized php-1.40.0-wmf.1/extensions/WikimediaEvents/includes/BlockMetrics/BlockMetricsHooks.php: Backport: [[gerrit:833809|Block metrics: Bump schema to un-require some fields (T317343)]] (duration: 03m 55s) [20:36:57] T317343: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343 [20:39:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:43:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:44:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:44:58] !log tgr@deploy1002 Synchronized php-1.40.0-wmf.2/extensions/WikimediaEvents/includes/BlockMetrics/BlockMetricsHooks.php: Backport: [[gerrit:833810|Block metrics: Bump schema to un-require some fields (T317343)]] (duration: 03m 42s) [20:45:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:45:01] T317343: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343 [20:46:37] !log UTC late deploys done [20:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:07] 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Harej) [20:49:22] 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Harej) [20:50:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:50:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:50:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:51:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:05:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:17:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:22:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:28:59] (03PS1) 10BCornwall: readme: Cleanups, clarifications, and typo fixes [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833855 [21:31:38] (03PS1) 10Jbond: puppet_compiler: fix wmcs environment name [puppet] - 10https://gerrit.wikimedia.org/r/833856 (https://phabricator.wikimedia.org/T318281) [21:32:41] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix wmcs environment name [puppet] - 10https://gerrit.wikimedia.org/r/833856 (https://phabricator.wikimedia.org/T318281) (owner: 10Jbond) [21:39:27] 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler, 10Patch-For-Review: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10jbond) >>! In T318281#8251715, @RLazarus wrote: > I was wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/831230 might be relat... [21:49:25] (03CR) 10Samtar: [C: 03+1] scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy) [22:18:39] (03PS1) 10Legoktm: Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/833812 [22:18:55] (03PS2) 10Legoktm: Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/833812 [22:34:11] (03PS1) 10Ryan Kemper: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) [22:34:13] (03PS1) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) [22:35:00] (03CR) 10CI reject: [V: 04-1] elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) (owner: 10Ryan Kemper) [22:47:51] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:17] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:58:19] (03PS2) 10Ryan Kemper: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) [22:58:21] (03PS2) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) [22:59:05] (03PS3) 10Ryan Kemper: elastic: rebalance enwiki_content shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833860 (https://phabricator.wikimedia.org/T318270) [22:59:07] (03PS3) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T318270) [23:20:21] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:23:27] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:23:27] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:24:29] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:25:04] (03CR) 10Arlolra: Enable Linter write of namespace tag and template fields on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [23:25:27] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:31:24] (03CR) 10Arlolra: Enable Linter write of namespace tag and template fields on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [23:36:20] (03CR) 10RLazarus: [C: 03+2] "Verified this is the same key that was removed in the original commit, so I'm forgoing the usual video chat to confirm identity." [puppet] - 10https://gerrit.wikimedia.org/r/833812 (owner: 10Legoktm) [23:41:51] legoktm: wb :) merged, and ran puppet manually on the bastions, feel free to test if you're around -- I'll let the other hosts update normally [23:44:39] (03PS1) 10Zabe: beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) [23:44:41] (03PS1) 10Zabe: beta: Pool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833866 (https://phabricator.wikimedia.org/T318126) [23:45:54] (03PS2) 10Zabe: beta: Pool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833866 (https://phabricator.wikimedia.org/T318126)