[00:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1102450 [00:38:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1102450 (owner: 10TrainBranchBot) [00:56:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1102450 (owner: 10TrainBranchBot) [01:08:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1102457 [01:08:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1102457 (owner: 10TrainBranchBot) [01:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [01:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [01:26:46] FIRING: KubernetesDeploymentUnavailableReplicas: ... [01:26:46] Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [01:27:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1102457 (owner: 10TrainBranchBot) [02:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10399535 (10phaultfinder) [05:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [05:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [05:26:46] FIRING: KubernetesDeploymentUnavailableReplicas: ... [05:26:46] Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [06:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:07] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 13497MiB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [06:31:07] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T0700) [07:00:05] marostegui, Amir1, and arnaudb: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T0700). [07:04:57] 06SRE: (blank/unknown) Error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048 (10JJMC89) 03NEW [07:18:31] (03PS1) 10Slyngshede: C:ldap::management default mfa to webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1102699 [07:21:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:22:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:39] 06SRE, 10Bitu, 06Infrastructure-Foundations: Bitu: Permission request state isn't refreshed if access has been revoked - https://phabricator.wikimedia.org/T382051 (10MoritzMuehlenhoff) 03NEW [07:44:17] (03PS1) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker20(47|66|85|86) [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) [07:46:26] (03CR) 10Jelto: "The `add_k8s_node.py` script would re-use the wikikube-worker ids from the decommissioning in T379788. I guess we don't want to use the id" [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [07:46:39] (03PS3) 10Abijeet Patro: Translate: Enable message group subscription for 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) [07:51:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1102699 (owner: 10Slyngshede) [07:52:14] (03CR) 10Slyngshede: [C:03+2] C:ldap::management default mfa to webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1102699 (owner: 10Slyngshede) [07:52:55] !log installing upx-ucl security updates [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:25] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [07:56:28] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T0800). [08:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:05] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:00:13] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:00:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1023.eqiad.wmnet with reason: maintenance [08:00:51] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:00:53] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:00:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1023.eqiad.wmnet with reason: maintenance [08:02:16] (03PS1) 10Marostegui: installserver: Do not reimage es2045 [puppet] - 10https://gerrit.wikimedia.org/r/1102723 [08:02:27] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:02:30] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:02:37] abijeet: here? [08:03:17] kart_, yup [08:03:41] I can deploy your patch. Going ahead. It had CI failure early, but recheck fixed it. [08:04:55] kart_, thanks [08:05:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [08:06:26] (03Merged) 10jenkins-bot: Translate: Enable message group subscription for 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [08:07:13] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1102283|Translate: Enable message group subscription for 7 wikis (T372386)]] [08:07:17] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [08:09:58] (03CR) 10Muehlenhoff: Inform users that their permission request have been approved/rejected (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede) [08:11:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1024.eqiad.wmnet with reason: maintenance [08:11:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1024.eqiad.wmnet with reason: maintenance [08:12:15] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1102283|Translate: Enable message group subscription for 7 wikis (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:12:18] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [08:12:35] (03PS2) 10Slyngshede: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 [08:12:39] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2045 [puppet] - 10https://gerrit.wikimedia.org/r/1102723 (owner: 10Marostegui) [08:14:16] abijeet: Please test! [08:14:33] 06SRE: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10399776 (10JJMC89) [08:14:42] kart_, ok, testing [08:16:26] 06SRE: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10399777 (10JJMC89) We should not be getting a 429 error (with no content) after a one (or two) accounts. [08:18:30] not sure what the right tags for ^ are but assistance with getting the right people to look at it is welcome [08:19:13] kart_, tested on mwdebug2001 server. looks ok. [08:20:58] Nice! [08:21:09] !log kartik@deploy2002 kartik, abi: Continuing with sync [08:21:23] (03PS1) 10Brouberol: mw-content-history-reconcile-enrich: add missing s3 configuration keys [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102729 (https://phabricator.wikimedia.org/T375176) [08:22:04] (03CR) 10JMeybohm: "Hm...yeah. Maybe wise to wait with reusing those until dc-ops has finished the decom." [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [08:22:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede) [08:23:43] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:23:46] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:26:17] (03PS3) 10Slyngshede: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 [08:26:27] (03CR) 10Slyngshede: Inform users that their permission request have been approved/rejected (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede) [08:27:19] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102283|Translate: Enable message group subscription for 7 wikis (T372386)]] (duration: 20m 05s) [08:27:22] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [08:32:07] (03CR) 10Slyngshede: [C:03+2] Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede) [08:37:37] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede) [08:39:23] (03Merged) 10jenkins-bot: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede) [08:43:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1026.eqiad.wmnet with reason: maintenance [08:43:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1026.eqiad.wmnet with reason: maintenance [08:54:22] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:54:25] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:01:13] (03CR) 10Gmodena: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102729 (https://phabricator.wikimedia.org/T375176) (owner: 10Brouberol) [09:01:26] (03CR) 10Brouberol: [C:03+2] mw-content-history-reconcile-enrich: add missing s3 configuration keys [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102729 (https://phabricator.wikimedia.org/T375176) (owner: 10Brouberol) [09:02:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1027.eqiad.wmnet with reason: maintenance [09:02:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1027.eqiad.wmnet with reason: maintenance [09:04:56] (03PS1) 10KartikMistry: Enable the Contribute menu in 5th group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102733 (https://phabricator.wikimedia.org/T380928) [09:07:47] (03CR) 10Gmodena: rdf-streaming-updater: add wdqs udpater streams in event stream config (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:17:33] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:17:38] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:18:48] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:18:53] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [09:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [09:22:37] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:22:42] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [09:26:46] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:26:46] Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [09:34:30] (03CR) 10Volans: "General approach LGTM, couple of nits inline. I'll leave the review of the specific k8s logic to your team." [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [09:36:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 (owner: 10Slyngshede) [09:39:07] (03CR) 10Slyngshede: [C:03+2] Release v0.1.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 (owner: 10Slyngshede) [09:43:08] (03Merged) 10jenkins-bot: Release v0.1.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 (owner: 10Slyngshede) [09:48:40] (03PS1) 10Brouberol: flink-operator: add mw-content-history-reconcile-enrich-next namespaces to the list of tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102741 (https://phabricator.wikimedia.org/T381322) [09:59:08] (03CR) 10Gmodena: [C:03+1] flink-operator: add mw-content-history-reconcile-enrich-next namespaces to the list of tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102741 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [09:59:13] (03CR) 10Brouberol: [C:03+2] flink-operator: add mw-content-history-reconcile-enrich-next namespaces to the list of tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102741 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [09:59:27] 06SRE, 10SRE-swift-storage: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056 (10MatthewVernon) 03NEW [10:00:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:00:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:06:46] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [10:06:46] Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [10:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:59] (03PS1) 10Elukey: Add maps-master{eqiad,codfw} among the postgres dst nets [puppet] - 10https://gerrit.wikimedia.org/r/1102744 [10:15:00] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4671/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102744 (owner: 10Elukey) [10:21:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes2046.codfw.wmnet, wikikube-worker2170.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2174.codfw.wmnet, wikikube-worker2120.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2172.codfw.wmnet [10:21:13] .codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, wikikube-worker2155.codfw.wmnet, kubernetes2052.codfw.wmnet, wikikube-worker2136.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2132.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker20 [10:21:13] .wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2352.codfw.wmnet, wikik https://wikitech.wikimedia.org/wiki/PyBal [10:21:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1102744 (owner: 10Elukey) [10:22:10] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2079.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2172.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2370.codf [10:22:13] wikikube-worker2136.codfw.wmnet, wikikube-worker2185.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2130.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2097.codfw.wmnet, mw2371.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikik [10:22:13] er2090.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2114.codfw.wmnet, wikikube-worker2062.codfw.wmnet, wikikube-worker2164.codfw.wmnet, wikikube-worker2123.codfw.wmnet, wikik https://wikitech.wikimedia.org/wiki/PyBal [10:22:37] (03PS1) 10Slyngshede: IDM update to Bitu 0.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1102762 [10:23:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:23:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:24:15] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:26:08] (03PS1) 10MVernon: swift: add new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102763 (https://phabricator.wikimedia.org/T382056) [10:26:09] (03PS1) 10MVernon: swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1102764 (https://phabricator.wikimedia.org/T382056) [10:26:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmn [10:26:13] rnetes2052.codfw.wmnet, wikikube-worker2150.codfw.wmnet, wikikube-worker2136.codfw.wmnet, wikikube-worker2185.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2130.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2139.codfw.w [10:26:13] 2352.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, kubernetes2037.codfw.wmnet, wikikube-worker2125.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359. https://wikitech.wikimedia.org/wiki/PyBal [10:26:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker2050.codfw.wmnet, wikikube-worker2174.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2171.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-wo [10:26:13] .codfw.wmnet, wikikube-worker2132.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2177.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-wor [10:26:13] codfw.wmnet, wikikube-worker2151.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2159.codfw.wmnet, wikikube-worker2124.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikub https://wikitech.wikimedia.org/wiki/PyBal [10:27:22] (03CR) 10Btullis: [C:03+1] "Thanks for the context Jelto. We're happy that this is only a trigger mechanism for deploying artifacts, and we are planning to use authen" [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [10:27:33] (03CR) 10Btullis: [C:03+1] Add Blunderbuss firewall rule to GitLab runner set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [10:27:50] (03CR) 10Btullis: [V:03+1 C:03+2] [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis) [10:28:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:27] checking [10:29:49] (03CR) 10Btullis: [C:03+1] airflow: enable the support of multiple executors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102312 (https://phabricator.wikimedia.org/T362788) (owner: 10Brouberol) [10:30:25] fabfur: I've acked the page FYI [10:30:56] (03CR) 10Volans: [C:03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1101903 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney) [10:31:24] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [10:31:57] (03CR) 10Btullis: [C:03+1] yarn: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100476 (owner: 10Muehlenhoff) [10:32:05] mmhh looks like overload? https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&refresh=1m&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-video&var-release=main [10:32:21] (03CR) 10Cathal Mooney: [C:03+2] Update JunOS templates to use VRRP priority exposed from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1101861 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney) [10:33:04] yeah, seems like a bunch of transcodes got scheduled maybe [10:33:17] hnowlan: you around? [10:33:36] we could scale it up maybe [10:33:37] (03Merged) 10jenkins-bot: Update JunOS templates to use VRRP priority exposed from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1101861 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney) [10:33:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:09] claime: might be worth a try yeah, nudge it a little bit [10:34:15] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:30] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:34:35] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:34:37] shellbox-video ate too much in preparation for xmas [10:35:04] claime: would you mind doing the honors of scaling up ? [10:35:12] yeah on it [10:35:24] thank you [10:36:26] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:36:31] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:38:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:38:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:38:34] (03PS1) 10Clément Goubert: shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 [10:39:15] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:39] (03CR) 10Filippo Giunchedi: [C:03+1] shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 (owner: 10Clément Goubert) [10:40:59] what's weird is I'm not seeing a particular uptick in webvideotranscode jobs in jobqueue or in mercurius [10:41:36] well there's an uptick but doesn't seem like it would warrant a complete saturation like that [10:41:43] (03CR) 10Clément Goubert: [C:03+2] shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 (owner: 10Clément Goubert) [10:43:01] interesting, maybe a few video whales [10:43:04] maybe it's just got long-running transcodes [10:43:13] (03Merged) 10jenkins-bot: shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 (owner: 10Clément Goubert) [10:43:38] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [10:43:47] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [10:43:56] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [10:44:04] could be too yeah, there was an increase in rps for sure [10:44:27] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002 [10:48:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002 [10:49:16] godog: what may happen is that helmfile will consider that the unavailable replicas make the deployment fail [10:49:30] claime: erk, thanks for handling that [10:49:35] that.. shouldn't happen [10:49:36] so I may have to wait for it to roll back, scale up manually, then helmfile apply [10:49:49] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400061 (10Krd) Request from :1227:f5ff:fec7:7ec1 via cp3070 cp3070, Varnish XID 926533531 Upstream caches: cp3070 int Error: 429, at Thu, 12 Dec 2024 10:46:40 GMT The same problem appears again and ag... [10:50:15] (03CR) 10Brouberol: [C:03+2] airflow: enable the support of multiple executors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102312 (https://phabricator.wikimedia.org/T362788) (owner: 10Brouberol) [10:50:51] hnowlan: scaling up basically just made the rps jump up to match the number of new replicas... [10:51:35] claime: old instances that are tied up will stay tied up so that's not a huge surprise [10:52:00] RPS doesn't have much of a direct relation to capacity for -video because one request can mean an instance is in use for hours [10:52:09] there are free healthy instances, that's all we care about for now [10:52:35] I'm not sure why since mercurius reports barely any jobs, and there's not that many jobs being enqueued, but that may still be enough to tie up all instances [10:52:52] ack thank you, but yeah the probe recovered even before scaling up [10:53:01] (still waiting on helmfile to fail btw...) [10:53:03] I guess shellbox-video burped and then moved on [10:53:14] I'm tuning concurrency in mercurius downwards [10:53:45] some of this is also related to mercurius losing track of jobs, which I am merging patches for today hopefully [10:53:45] hnowlan: ack [10:54:08] (03PS1) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) [10:54:31] (03PS1) 10Hnowlan: mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 [10:55:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4672/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [10:55:35] (03CR) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [10:56:15] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 (owner: 10Hnowlan) [10:56:17] (03CR) 10Clément Goubert: [C:03+1] mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 (owner: 10Hnowlan) [10:56:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 196, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:54] (03Merged) 10jenkins-bot: mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 (owner: 10Hnowlan) [10:57:59] helmfile is rolling back [10:58:04] (03CR) 10Filippo Giunchedi: s.m.a.r.t. - Exclude zram devices from data export (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [10:58:31] 13m Normal ScalingReplicaSet deployment/shellbox-main Scaled up replica set shellbox-main-5db67dd6fc to 60 [10:58:32] 3m38s Normal ScalingReplicaSet deployment/shellbox-main Scaled down replica set shellbox-main-5db67dd6fc to 48 [10:58:47] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 277, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:29] ouch, yeah default helmfile behaviour is understandable, and suboptimal for sure in this case [10:59:55] yeah, the "availability" for shellbox-video is a bit of a tricky concept [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1100) [11:00:21] indeed [11:00:22] we *want* the replicas to be unavailable while treating a request [11:00:46] but that makes scaling up through helmfile a problem when overloaded [11:01:03] tricky alright in this case [11:01:10] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:01:52] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [11:01:59] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [11:02:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [11:02:35] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [11:03:22] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002 [11:03:33] (03PS2) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) [11:03:44] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [11:03:49] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [11:04:02] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [11:04:13] (03CR) 10CI reject: [V:04-1] s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [11:04:14] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [11:04:21] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4673/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [11:04:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002 [11:06:17] (03CR) 10Cathal Mooney: [C:03+2] Expose VRRP group assignment priority to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1101903 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney) [11:07:17] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002 [11:07:26] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:07:30] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:07:50] ok I've scaled it up manually, I've helmfile apply so it's synced up [11:08:09] thanks! I've lowered concurrency so once those jobs fail it'll be a lot quieter [11:08:22] thank you folks, appreciate it [11:08:33] I *think* I know what caused this (alongside the concurrency being high) so I will have a fix in place in an hour or two [11:08:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002 [11:09:15] hnowlan: the brunt of it seems to have passed, looking at the network rx [11:09:48] my theory is that mercurius lost track of a bunch of jobs at once, and then retried [11:09:52] but shellbox was still processing them [11:09:57] hmh [11:10:24] (03PS3) 10Hnowlan: base: fix pin on base.meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 [11:12:28] (03PS3) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) [11:12:37] (03CR) 10Hnowlan: [C:03+2] "I've updated this patch to do that - thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 (owner: 10Hnowlan) [11:13:09] (03CR) 10CI reject: [V:04-1] s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [11:13:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4674/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [11:13:45] (03Merged) 10jenkins-bot: base: fix pin on base.meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 (owner: 10Hnowlan) [11:14:09] going to lunch [11:15:51] (03CR) 10Hnowlan: [C:03+2] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:16:51] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@0e18d4f]: Backfill webrequest actor label hourly 2024 12 [11:19:43] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@0e18d4f]: Backfill webrequest actor label hourly 2024 12 (duration: 02m 52s) [11:22:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2186.codfw.wmnet with reason: maintenance [11:22:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2186.codfw.wmnet with reason: maintenance [11:23:36] (03CR) 10Marostegui: [C:03+1] swift: add new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102763 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:24:27] (03CR) 10Marostegui: [C:03+1] swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1102764 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:25:38] (03CR) 10Elukey: [V:03+1 C:03+2] Add maps-master{eqiad,codfw} among the postgres dst nets [puppet] - 10https://gerrit.wikimedia.org/r/1102744 (owner: 10Elukey) [11:26:49] (03PS2) 10أنون: [enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) [11:27:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2186.codfw.wmnet with reason: maintenance [11:27:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2186.codfw.wmnet with reason: maintenance [11:27:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2187.codfw.wmnet with reason: maintenance [11:28:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2187.codfw.wmnet with reason: maintenance [11:29:12] (03Merged) 10jenkins-bot: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:29:53] (03CR) 10MVernon: [C:03+2] swift: add new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102763 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:30:02] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10400168 (10Clement_Goubert) [11:32:22] (03PS4) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) [11:32:59] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [11:33:06] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [11:33:14] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4675/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [11:33:21] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [11:33:37] (03CR) 10Btullis: [V:03+1] s.m.a.r.t. - Exclude zram devices from data export (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [11:33:47] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:34:42] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:35:17] PROBLEM - Host ms-be1084 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:37] PROBLEM - Host ms-be1087 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:59] PROBLEM - Host ms-be1083 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:02] (03PS6) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) [11:36:15] PROBLEM - Host ms-be1085 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:15] PROBLEM - Host ms-be1090 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:16] (03PS4) 10Elukey: services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 [11:36:16] (03PS10) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [11:36:16] (03PS6) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [11:36:16] (03PS7) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [11:36:17] PROBLEM - Host ms-be1088 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:27] PROBLEM - Host ms-be1089 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:27] PROBLEM - Host ms-be1086 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:53] RECOVERY - Host ms-be1088 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [11:36:55] RECOVERY - Host ms-be1084 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [11:36:55] RECOVERY - Host ms-be1086 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [11:36:56] RECOVERY - Host ms-be1089 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [11:37:13] RECOVERY - Host ms-be1083 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [11:37:13] RECOVERY - Host ms-be1090 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [11:37:13] RECOVERY - Host ms-be1087 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [11:37:13] RECOVERY - Host ms-be1085 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:37:30] (03PS1) 10Slyngshede: P:idm_test adjust idptest-users requirement [puppet] - 10https://gerrit.wikimedia.org/r/1102793 [11:38:13] (03CR) 10Elukey: "I simplified the change and allowed also the master replica to be contacted, we can review the choice later." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey) [11:39:36] (03CR) 10Elukey: services: add helmfile config for Kartotherian (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:40:43] PROBLEM - Host ms-be2086 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:43] PROBLEM - Host ms-be2084 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:43] PROBLEM - Host ms-be2083 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:43] (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:40:45] PROBLEM - Host ms-be2087 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:53] PROBLEM - Host ms-be2085 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:40:59] PROBLEM - Host ms-be2081 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:59] (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:41:03] PROBLEM - Host ms-be2082 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1154.eqiad.wmnet with reason: maintenance [11:41:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1154.eqiad.wmnet with reason: maintenance [11:41:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:41:41] RECOVERY - Host ms-be2087 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [11:41:41] RECOVERY - Host ms-be2085 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [11:41:41] RECOVERY - Host ms-be2084 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [11:41:41] RECOVERY - Host ms-be2082 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [11:41:41] RECOVERY - Host ms-be2086 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [11:41:43] RECOVERY - Host ms-be2081 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [11:41:47] RECOVERY - Host ms-be2083 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [11:42:32] (03Merged) 10jenkins-bot: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:43:56] (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey) [11:44:22] jouncebot: nowandnext [11:44:22] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1100) [11:44:22] In 1 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1300) [11:44:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:45:26] I'm going to do a sync-world to roll out changes to the mediawiki chart's dependencies in a few minutes, speak now or etc etc [11:48:46] (03CR) 10MVernon: [C:03+2] swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1102764 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [12:01:57] Emperor: just checking, are all those up/down notifications expected? [12:03:29] !log hnowlan@deploy2002 Started scap sync-world: syncing changes to mediawiki chart vendor dependencies [12:04:06] (03PS1) 10Giuseppe Lavagetto: Bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1102794 (https://phabricator.wikimedia.org/T382062) [12:04:22] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1102794 (https://phabricator.wikimedia.org/T382062) (owner: 10Giuseppe Lavagetto) [12:06:22] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002 - T382062" [12:06:25] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 - T382062 [12:06:59] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 - T382062 [12:07:00] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002 - T382062" [12:09:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [12:09:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [12:10:48] !log hnowlan@deploy2002 Finished scap sync-world: syncing changes to mediawiki chart vendor dependencies (duration: 09m 30s) [12:15:27] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [12:15:32] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [12:15:39] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [12:15:52] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [12:18:35] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400297 (10Clement_Goubert) Hi, Sorry for the delay in responding. Could you describe how you are creating the accounts? Are you using your browser and the `Special:CreateAccount` page, a script, a gadget? This... [12:18:52] (03PS1) 10Hnowlan: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) [12:20:05] (03CR) 10CI reject: [V:04-1] mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:23:02] (03PS1) 10Máté Szabó: Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) [12:24:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [12:24:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [12:26:54] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Exclude zram devices from disk health checks - https://phabricator.wikimedia.org/T380835#10400317 (10BTullis) [12:29:57] !log btullis@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938 [12:30:17] !log btullis@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938 [12:31:12] !log btullis@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938 [12:35:37] (03CR) 10Harroyo-wmf: [C:03+1] Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó) [12:36:19] !log btullis@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938 [12:41:06] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400370 (10Krd) Special:CreateAccount in a normal browser. [12:46:14] (03CR) 10Ammarpad: Enable IRS in the Project namespace on ptwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó) [12:50:11] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:50:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:53:32] (03PS1) 10Hnowlan: mesh.configuration: dummy commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102838 [12:54:07] (03CR) 10Kosta Harlan: [C:03+1] Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó) [12:55:02] jouncebot: nowandnext [12:55:02] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [12:55:02] In 0 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1300) [12:56:30] (03PS2) 10Máté Szabó: Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) [12:56:34] (03CR) 10Máté Szabó: Enable IRS in the Project namespace on ptwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó) [12:59:01] (03PS1) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102839 (https://phabricator.wikimedia.org/T371701) [12:59:11] (03CR) 10CI reject: [V:04-1] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102839 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1300) [13:00:47] (03CR) 10Muehlenhoff: [C:03+1] P:idm_test adjust idptest-users requirement [puppet] - 10https://gerrit.wikimedia.org/r/1102793 (owner: 10Slyngshede) [13:01:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó) [13:02:02] (03Merged) 10jenkins-bot: Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó) [13:02:17] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1102813|Enable IRS in the Project namespace on ptwiki (T382061)]] [13:02:21] T382061: Enable incident reporting on project namespace for ptwiki - https://phabricator.wikimedia.org/T382061 [13:03:10] (03PS1) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) [13:03:13] (03Abandoned) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102839 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [13:05:23] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1102813|Enable IRS in the Project namespace on ptwiki (T382061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:09] (03PS1) 10Slyngshede: Notifications: Improve wording in topic [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 [13:06:17] (03CR) 10Slyngshede: [C:03+2] P:idm_test adjust idptest-users requirement [puppet] - 10https://gerrit.wikimedia.org/r/1102793 (owner: 10Slyngshede) [13:06:32] !log mszabo@deploy2002 mszabo: Continuing with sync [13:11:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 (owner: 10Slyngshede) [13:11:58] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102813|Enable IRS in the Project namespace on ptwiki (T382061)]] (duration: 09m 41s) [13:12:02] T382061: Enable incident reporting on project namespace for ptwiki - https://phabricator.wikimedia.org/T382061 [13:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:14:34] (03CR) 10Slyngshede: [C:03+2] Notifications: Improve wording in topic [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 (owner: 10Slyngshede) [13:15:22] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [13:15:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc[1013,1017].eqiad.wmnet with reason: maintenance [13:15:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[1013,1017].eqiad.wmnet with reason: maintenance [13:15:57] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2127.codfw.wmnet [13:16:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2127.codfw.wmnet [13:16:57] (03CR) 10Filippo Giunchedi: [C:03+1] "IIRC alerts should be undeployed automatically, easy to check post puppet-run on prometheus hosts whether /srv/alerts/ops/team-o11y_sli.ya" [alerts] - 10https://gerrit.wikimedia.org/r/1102366 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:17:45] (03CR) 10DCausse: rdf-streaming-updater: add wdqs udpater streams in event stream config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [13:18:10] (03PS1) 10KartikMistry: Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) [13:18:24] (03Merged) 10jenkins-bot: Notifications: Improve wording in topic [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 (owner: 10Slyngshede) [13:18:27] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2127.codfw.wmnet with OS bookworm [13:18:46] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2127 [13:18:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2127 [13:19:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc[2014,2016].codfw.wmnet with reason: maintenance [13:19:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2014,2016].codfw.wmnet with reason: maintenance [13:20:23] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400469 (10Clement_Goubert) I've tweaked a rate limiting rule, could you please try again? If possible, could you enable the developer tools in your browser, and get the request headers sent in the `POST` reque... [13:20:39] (03PS3) 10Volans: cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) [13:20:39] (03CR) 10Volans: "ready" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:21:08] (03CR) 10Elukey: [C:03+2] services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey) [13:21:56] (03CR) 10Clément Goubert: [C:03+1] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [13:22:31] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:56] (03CR) 10Clément Goubert: mediawiki: generate names for mercurius jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [13:29:49] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [13:30:12] (03CR) 10JMeybohm: [WIP, DNM] create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [13:32:53] !log rebalance Ganeti cluster in codfw/D following server refresh T376594 [13:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:57] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [13:33:05] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400510 (10Krd) The account was created in the meantime. I suggest I test this at the next opportunity and report here. Thank you! [13:36:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:36:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:36:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T381532)', diff saved to https://phabricator.wikimedia.org/P71703 and previous config saved to /var/cache/conftool/dbconfig/20241212-133633-marostegui.json [13:36:40] T381532: Fix AntiSpoof database schema drifts in production - https://phabricator.wikimedia.org/T381532 [13:37:10] FIRING: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:18] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage [13:38:55] (03CR) 10Elukey: [C:03+1] cookbook: add owner_team property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:39:15] RESOLVED: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:56] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [13:40:29] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10400526 (10MoritzMuehlenhoff) [13:40:40] (03PS1) 10Milimetric: analytics/html: update readme for MW history dump [puppet] - 10https://gerrit.wikimedia.org/r/1102848 (https://phabricator.wikimedia.org/T381390) [13:41:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage [13:41:14] !log installing Python 3.11 security updates [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:47] (03CR) 10Btullis: [V:03+1 C:03+2] s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis) [13:44:05] (03PS1) 10Muehlenhoff: (Re)assign builder role to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102849 (https://phabricator.wikimedia.org/T379343) [13:46:18] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [13:46:29] !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [13:46:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1169.eqiad.wmnet with reason: maintenance [13:46:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: maintenance [13:47:02] !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [13:47:47] !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:48:01] !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:48:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71704 and previous config saved to /var/cache/conftool/dbconfig/20241212-134824-root.json [13:48:35] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [13:48:43] (03PS1) 10Muehlenhoff: miscweb: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1102850 [13:48:45] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [13:52:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102850 (owner: 10Muehlenhoff) [13:52:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons. [13:52:55] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [13:53:25] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [13:53:33] (03CR) 10Muehlenhoff: [C:03+2] (Re)assign builder role to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102849 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [13:54:01] (03PS4) 10Volans: cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) [13:54:04] (03CR) 10Volans: cookbook: add owner_team property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:55:47] !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:48] nothing to deploy :) [14:01:34] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2127.codfw.wmnet with OS bookworm [14:01:57] !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons. [14:02:40] (03PS1) 10Marostegui: site.pp: Update notes for codfw proxies [puppet] - 10https://gerrit.wikimedia.org/r/1102852 (https://phabricator.wikimedia.org/T381962) [14:03:00] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [14:03:11] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2127.codfw.wmnet [14:03:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2127.codfw.wmnet [14:03:17] elukey: sorry, IRC client hid this, but yes - I rebooted all the new nodes (SOP part of bringing them online) [14:03:28] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [14:03:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71705 and previous config saved to /var/cache/conftool/dbconfig/20241212-140329-root.json [14:03:54] (03CR) 10Marostegui: [C:03+2] site.pp: Update notes for codfw proxies [puppet] - 10https://gerrit.wikimedia.org/r/1102852 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui) [14:04:48] 06SRE, 06serviceops: HTTP 429 error on VRT wiki trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10400598 (10Aklapper) [14:04:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet [14:09:04] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Exclude zram devices from disk health checks - https://phabricator.wikimedia.org/T380835#10400604 (10BTullis) 05Open→03Resolved [14:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:13] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host db1208.eqiad.wmnet [14:17:50] (03PS1) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [14:18:29] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [14:18:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71706 and previous config saved to /var/cache/conftool/dbconfig/20241212-141835-root.json [14:19:56] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4676/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [14:21:16] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4677/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [14:30:08] !log ladsgroup@mwmaint2002:~$ foreachwikiindblist all userOptions.php --delete VectorSkinVersion (T54777) [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:13] T54777: user_properties table bloat - https://phabricator.wikimedia.org/T54777 [14:30:22] (03CR) 10Gmodena: [C:03+1] rdf-streaming-updater: add wdqs udpater streams in event stream config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [14:31:18] (03CR) 10AOkoth: [C:03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [14:33:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71707 and previous config saved to /var/cache/conftool/dbconfig/20241212-143340-root.json [14:39:53] (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:40:00] (03CR) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:40:20] (03CR) 10Hnowlan: [C:03+2] mesh.configuration: dummy commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102838 (owner: 10Hnowlan) [14:41:34] (03Merged) 10jenkins-bot: mesh.configuration: dummy commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102838 (owner: 10Hnowlan) [14:41:55] (03PS2) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) [14:45:19] (03PS2) 10Hnowlan: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) [14:45:27] (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:47:03] (03PS11) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [14:47:09] (03PS7) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [14:47:13] (03PS8) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [14:47:24] (03Merged) 10jenkins-bot: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:48:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71708 and previous config saved to /var/cache/conftool/dbconfig/20241212-144846-root.json [14:53:08] (03CR) 10Eevans: [C:03+2] aqs: Upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1102377 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [14:53:55] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [14:54:00] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [14:54:14] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [14:54:27] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [14:54:29] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:54:35] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:55:37] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [14:55:41] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [14:57:52] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:57:55] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:58:04] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:58:07] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:58:26] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:58:31] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:02:11] (03PS1) 10Ladsgroup: Add tigwiki to pre-install [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102872 (https://phabricator.wikimedia.org/T381377) [15:02:31] (03CR) 10Sbisson: [C:03+1] Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) (owner: 10KartikMistry) [15:02:35] jouncebot: nowandnext [15:02:35] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [15:02:35] In 1 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1700) [15:02:36] (03CR) 10JMeybohm: [C:03+1] services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:02:41] cool [15:03:15] (03CR) 10JMeybohm: [C:03+1] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:03:28] !log eevans@cumin1002 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [15:03:32] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [15:04:44] RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:04:44] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:05:13] FIRING: IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:05:14] FIRING: IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:06:18] (03CR) 10Elukey: [C:03+2] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:06:33] (03CR) 10Elukey: [C:03+2] charts: Add kartotherian (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:07:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102872 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup) [15:07:35] (03Merged) 10jenkins-bot: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:07:50] (03Merged) 10jenkins-bot: Add tigwiki to pre-install [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102872 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup) [15:08:05] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1102872|Add tigwiki to pre-install (T381377)]] [15:08:10] T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377 [15:09:11] !log bking@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: 0.3.150 [15:09:17] !log bking@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: 0.3.150 (duration: 00m 05s) [15:09:17] (03CR) 10Elukey: [C:03+2] admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:09:25] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [15:09:27] (03CR) 10Elukey: [C:03+2] services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:09:29] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [15:10:03] (03PS1) 10Bking: wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) [15:11:38] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1102872|Add tigwiki to pre-install (T381377)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:12:06] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:15:59] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) (owner: 10KartikMistry) [15:16:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1025:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:16:26] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [15:16:29] (03CR) 10JHathaway: [C:03+2] vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [15:16:30] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [15:16:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [15:16:52] !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [15:17:07] (03Merged) 10jenkins-bot: Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) (owner: 10KartikMistry) [15:17:41] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102872|Add tigwiki to pre-install (T381377)]] (duration: 09m 35s) [15:17:44] T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377 [15:18:04] !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [15:18:17] (03CR) 10CDanis: [C:03+1] wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:18:24] (03CR) 10JHathaway: [C:03+1] wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:19:02] !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:19:31] !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:19:38] (03CR) 10Bking: [C:03+2] wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [15:20:13] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:22:35] (03PS1) 10Ladsgroup: Activate tigwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102879 (https://phabricator.wikimedia.org/T381377) [15:22:50] (03PS1) 10Hnowlan: Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 [15:22:51] (03PS1) 10Volans: admin: add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1102880 [15:23:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102879 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup) [15:24:01] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [15:24:29] (03Merged) 10jenkins-bot: Activate tigwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102879 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup) [15:24:45] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1102879|Activate tigwiki (T381377)]] [15:24:49] T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377 [15:24:59] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:25:36] (03PS1) 10Muehlenhoff: Set profile::docker::builder::docker_pkg: false for build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102883 (https://phabricator.wikimedia.org/T379343) [15:27:32] (03PS2) 10Muehlenhoff: Set profile::docker::builder::docker_pkg: false for build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102883 (https://phabricator.wikimedia.org/T379343) [15:28:00] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1102879|Activate tigwiki (T381377)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:28:38] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:29:18] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:30:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, this needs no further approval and you can self-merge" [puppet] - 10https://gerrit.wikimedia.org/r/1102880 (owner: 10Volans) [15:30:34] (03CR) 10Volans: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1102880 (owner: 10Volans) [15:32:44] (03PS1) 10Klausman: sre/ores: remove obsolete ORES cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) [15:34:05] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [15:34:10] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102879|Activate tigwiki (T381377)]] (duration: 09m 25s) [15:34:14] T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377 [15:39:40] (03PS1) 10Volans: sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) [15:39:45] (03PS1) 10Muehlenhoff: sre.misc-clusters.thumbor: Remove obsolete cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 [15:40:20] (03CR) 10Hnowlan: [C:03+1] sre.misc-clusters.thumbor: Remove obsolete cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 (owner: 10Muehlenhoff) [15:40:30] (03CR) 10Volans: "Proposing the removal of this old and half-baked cookbook. Adding the last known users to check if there is any objection." [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans) [15:40:46] (03PS1) 10Elukey: services: fix helmfile config for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102889 [15:40:59] (03CR) 10Xcollazo: [C:03+1] analytics/html: update readme for MW history dump [puppet] - 10https://gerrit.wikimedia.org/r/1102848 (https://phabricator.wikimedia.org/T381390) (owner: 10Milimetric) [15:43:06] !log bking@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: 0.3.150 [15:43:19] !log bking@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: 0.3.150 (duration: 00m 13s) [15:43:31] (03CR) 10Volans: [C:03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) (owner: 10Klausman) [15:43:36] (03PS9) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [15:44:14] (03CR) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:46:41] (03CR) 10Muehlenhoff: [C:03+1] "Good to go IMO" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans) [15:46:49] (03CR) 10Muehlenhoff: [C:03+2] sre.misc-clusters.thumbor: Remove obsolete cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 (owner: 10Muehlenhoff) [15:47:40] (03CR) 10Volans: [C:03+1] "LGTM, you can test it with the test-cookbook both in dry-run and real runs (I suggest to leave it log to SAL for awareness)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:48:48] (03CR) 10Muehlenhoff: [C:03+2] "Ran the steps from https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook post merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 (owner: 10Muehlenhoff) [15:49:13] (03CR) 10Muehlenhoff: [C:03+2] Set profile::docker::builder::docker_pkg: false for build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102883 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [15:51:21] (03CR) 10Elukey: [C:03+2] services: fix helmfile config for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102889 (owner: 10Elukey) [15:56:31] (03PS2) 10DLynch: EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) [15:56:36] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [15:59:18] (03CR) 10Clément Goubert: [C:03+1] Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan) [15:59:34] 06SRE, 06serviceops: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10400945 (10JJMC89) Similar to {T359901}, which appears to be caused by the mitigation for {T341908}. [15:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10400950 (10phaultfinder) [16:00:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch) [16:04:03] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan) [16:04:25] (03CR) 10Hnowlan: [C:03+2] Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan) [16:05:36] (03Merged) 10jenkins-bot: Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan) [16:06:19] (03PS10) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [16:06:40] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:06:43] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1270-1275].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:08:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1270.eqiad.wmnet with OS bookworm [16:08:35] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1270 [16:08:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1270 [16:09:52] (03CR) 10Herron: [C:03+2] "perfect thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1102366 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:13:20] (03CR) 10Dzahn: [C:03+2] admin: add group approvers for druid-admins and htmldumps-admin [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [16:13:40] (03PS1) 10Elukey: charts: update kartotherian's entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102894 [16:14:03] (03CR) 10Dzahn: [C:03+2] miscweb: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1102850 (owner: 10Muehlenhoff) [16:16:39] (03CR) 10Elukey: [C:03+2] charts: update kartotherian's entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102894 (owner: 10Elukey) [16:17:11] (03PS1) 10Volans: tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) [16:17:17] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:18:19] (03CR) 10CI reject: [V:04-1] EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch) [16:19:44] (03CR) 10Volans: [C:03+2] cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:19:55] (03PS3) 10DLynch: EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) [16:19:58] (03CR) 10Dzahn: [C:03+2] "yep, https://commons-query.wikimedia.org/ still working" [puppet] - 10https://gerrit.wikimedia.org/r/1102850 (owner: 10Muehlenhoff) [16:20:37] (03CR) 10Elukey: [C:03+1] tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:23:00] (03PS1) 10Hnowlan: Revert "mw-videoscaler: lower concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102898 [16:27:21] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:27:41] PROBLEM - MariaDB Replica SQL: s3 #page on db2149 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:27:42] PROBLEM - MariaDB Replica SQL: s3 #page on db2194 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:27:51] PROBLEM - MariaDB Replica SQL: s3 #page on db1175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:27:55] PROBLEM - MariaDB Replica SQL: s3 #page on db2205 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:28:05] o_O [16:28:08] !incidents [16:28:09] 5537 (UNACKED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [16:28:09] 5538 (UNACKED) db2194 (paged)/MariaDB Replica SQL: s3 (paged) [16:28:09] 5539 (UNACKED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [16:28:10] 5540 (UNACKED) db2205 (paged)/MariaDB Replica SQL: s3 (paged) [16:28:10] 5536 (RESOLVED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [16:28:18] sigh [16:28:22] this is fun [16:28:30] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1270.eqiad.wmnet with reason: host reimage [16:28:33] !ack 5537 5538 [16:28:33] Could not ack the alert. Please check the parameters. [16:28:35] PROBLEM - MariaDB Replica SQL: s3 #page on db2177 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:28:39] PROBLEM - MariaDB Replica SQL: s3 #page on db2190 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:28:52] !ack 5537 [16:28:52] 5537 (ACKED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [16:28:54] !ack 5538 [16:28:54] 5538 (ACKED) db2194 (paged)/MariaDB Replica SQL: s3 (paged) [16:28:56] !ack 5539 [16:28:57] 5539 (ACKED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:00] !ack 5540 [16:29:00] 5540 (ACKED) db2205 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:05] give me a bit [16:29:06] !incidents [16:29:07] I fix this [16:29:07] 5537 (ACKED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:07] 5538 (ACKED) db2194 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:07] 5539 (ACKED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:07] 5540 (ACKED) db2205 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:08] 5541 (UNACKED) db2177 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:08] 5542 (UNACKED) db2190 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:08] 5536 (RESOLVED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [16:29:24] Amir1: thanks! [16:29:34] !ack 5541 [16:29:35] 5541 (ACKED) db2177 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:35] !ack 5542 [16:29:35] 5542 (ACKED) db2190 (paged)/MariaDB Replica SQL: s3 (paged) [16:29:47] Thanks Amir. [16:30:45] RECOVERY - MariaDB Replica SQL: s3 #page on db2149 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:31:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1270.eqiad.wmnet with reason: host reimage [16:31:36] RECOVERY - MariaDB Replica SQL: s3 #page on db2177 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:31:40] RECOVERY - MariaDB Replica SQL: s3 #page on db2190 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:31:45] (03Merged) 10jenkins-bot: cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:32:45] !incidents [16:32:45] all fixed [16:32:45] 5538 (ACKED) db2194 (paged)/MariaDB Replica SQL: s3 (paged) [16:32:45] 5539 (ACKED) db1175 (paged)/MariaDB Replica SQL: s3 (paged) [16:32:46] 5540 (ACKED) db2205 (paged)/MariaDB Replica SQL: s3 (paged) [16:32:46] 5542 (RESOLVED) db2190 (paged)/MariaDB Replica SQL: s3 (paged) [16:32:46] 5541 (RESOLVED) db2177 (paged)/MariaDB Replica SQL: s3 (paged) [16:32:46] RECOVERY - MariaDB Replica SQL: s3 #page on db2194 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:32:46] 5537 (RESOLVED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [16:32:46] 5536 (RESOLVED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [16:32:56] thanks much Amir1! [16:34:33] (03CR) 10Hnowlan: [C:03+2] Revert "mw-videoscaler: lower concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102898 (owner: 10Hnowlan) [16:35:33] (03Merged) 10jenkins-bot: Revert "mw-videoscaler: lower concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102898 (owner: 10Hnowlan) [16:36:05] (03CR) 10Volans: [C:03+2] tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:37:38] RECOVERY - MariaDB Replica SQL: s3 #page on db1175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:37:44] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [16:37:48] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [16:39:13] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:39:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [16:39:26] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:41:41] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:42:59] (03PS3) 10Hnowlan: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) [16:43:09] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [16:44:31] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:45:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:51] (03Merged) 10jenkins-bot: tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:46:33] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10401062 (10LSobanski) @Platonides could you help us verify this? [16:47:11] (03CR) 10Hnowlan: [C:03+2] mediawiki: generate names for mercurius jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [16:47:40] (03PS1) 10Elukey: charts: add volume mount for /etc to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102901 [16:49:01] (03Merged) 10jenkins-bot: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [16:49:15] 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401070 (10Clement_Goubert) p:05Triage→03High [16:49:44] 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401073 (10Clement_Goubert) [16:49:45] 06SRE, 06serviceops: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10401076 (10Clement_Goubert) →14Duplicate dup:03T359901 [16:50:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1270.eqiad.wmnet with OS bookworm [16:50:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:45] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:51:54] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1271.eqiad.wmnet with OS bookworm [16:52:02] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1271 [16:52:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1271 [16:52:10] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [16:52:14] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [16:52:47] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [16:52:51] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [16:53:00] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: sync [16:53:03] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: sync [16:53:26] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: sync [16:53:29] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: sync [16:55:57] (03PS1) 10Hnowlan: mediawiki: revert generateName behaviour in mercurius job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) [16:56:19] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10401093 (10Jhancock.wm) [16:57:10] 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401106 (10Clement_Goubert) >>! In T359901#10400510, @Krd wrote: > The account was created in the meantime. I suggest I test this at the next opportunity and repo... [16:57:21] 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401108 (10Clement_Goubert) 05Open→03In progress [16:57:42] RECOVERY - MariaDB Replica SQL: s3 #page on db2205 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:58:38] (03PS2) 10Elukey: charts: add volume mount for /etc to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102901 [16:58:55] (03CR) 10Clément Goubert: [C:03+1] mediawiki: revert generateName behaviour in mercurius job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [16:59:51] (03PS2) 10Hnowlan: mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) [17:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:02:39] (03CR) 10Clément Goubert: [C:03+1] mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:03:21] (03CR) 10Hnowlan: [C:03+2] mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:05:03] (03Merged) 10jenkins-bot: mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:06:43] (03CR) 10Elukey: [C:03+2] charts: add volume mount for /etc to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102901 (owner: 10Elukey) [17:08:41] (03PS1) 10Bvibber: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102903 (https://phabricator.wikimedia.org/T382039) [17:11:07] anybody doing any infrastructure/service deploys during this hour before the mw backport window? [17:11:16] if not i'll go ahead and deploy that service update for chart-renderer [17:11:23] (03PS1) 10Hnowlan: mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) [17:11:56] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1271.eqiad.wmnet with reason: host reimage [17:12:07] bvibber: at least the part that is on the deployment calendar, no, nothing. puppet window is empty [17:12:10] FIRING: ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1012-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:26] spiffy :D [17:13:29] (03PS1) 10DDesouza: Reader Survey: Deploy on eswiki, dewiki and frwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) [17:13:38] jouncebot: nowandnext [17:13:38] For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1700) [17:13:38] In 0 hour(s) and 46 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800) [17:13:39] In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800) [17:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:13:45] 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401161 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert The problem should be resolved for all private wikis now. [17:13:49] !log doing service deploy for chart-renderer (T382039) [17:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:53] T382039: chart-renderer services update - https://phabricator.wikimedia.org/T382039 [17:14:15] RESOLVED: ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1012-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:36] (03CR) 10Clément Goubert: [C:03+1] mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:14:47] (03CR) 10Bvibber: [C:03+2] "merging for deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102903 (https://phabricator.wikimedia.org/T382039) (owner: 10Bvibber) [17:14:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [17:15:04] (03CR) 10Hnowlan: [C:03+2] mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:15:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1271.eqiad.wmnet with reason: host reimage [17:15:36] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [17:15:48] (03CR) 10Btullis: [C:03+2] Add Blunderbuss firewall rule to GitLab runner set [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [17:16:03] (03Merged) 10jenkins-bot: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102903 (https://phabricator.wikimedia.org/T382039) (owner: 10Bvibber) [17:16:59] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [17:17:25] (03Merged) 10jenkins-bot: mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:18:16] (03PS2) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [17:18:25] !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [17:18:54] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [17:19:08] !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [17:19:47] !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [17:20:12] (03PS3) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [17:20:18] !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [17:20:26] !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [17:20:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10401197 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:21:02] !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [17:21:26] (03PS1) 10Hnowlan: mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) [17:21:33] (03CR) 10CI reject: [V:04-1] mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:22:05] (03PS2) 10Hnowlan: mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) [17:22:37] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [17:23:28] !log charts-renderer deployment T382039 complete [17:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:32] T382039: chart-renderer services update - https://phabricator.wikimedia.org/T382039 [17:25:40] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:27:09] (03PS4) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [17:31:24] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4678/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [17:31:37] (03CR) 10Scott French: [C:03+1] mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:32:45] (03PS5) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [17:32:45] (03CR) 10Hnowlan: [C:03+2] mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:33:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1271.eqiad.wmnet with OS bookworm [17:34:42] (03Merged) 10jenkins-bot: mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:35:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1272.eqiad.wmnet with OS bookworm [17:35:41] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1272 [17:35:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1272 [17:37:04] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:37:10] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:37:25] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4679/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [17:38:53] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c2d7e08]: Backfill pageview actor hourly 2024 12 [17:41:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [17:41:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [17:41:29] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [17:41:42] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [17:41:56] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@c2d7e08]: Backfill pageview actor hourly 2024 12 (duration: 03m 03s) [17:51:11] (03PS1) 10Michael Große: beta: enable updating link-suggestions from read-mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) [17:53:57] !log killing wikidatawiki xml dump process to try to unstick it - T382084 [17:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:01] T382084: 20241201 wikidatawiki xml dump not progressing - https://phabricator.wikimedia.org/T382084 [17:54:25] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:54:32] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:54:39] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.01e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [17:55:46] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1272.eqiad.wmnet with reason: host reimage [17:58:13] (03PS1) 10Gmodena: dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) [17:58:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1272.eqiad.wmnet with reason: host reimage [17:58:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:00:05] bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800) [18:00:33] (03PS1) 10Hnowlan: mediawiki: try to preserve superseded mercurius instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102915 (https://phabricator.wikimedia.org/T371700) [18:08:47] !log Running `mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType Z4 --report --verbose` [18:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1272.eqiad.wmnet with OS bookworm [18:19:12] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1273.eqiad.wmnet with OS bookworm [18:19:20] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1273 [18:19:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1273 [18:22:27] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans) [18:22:27] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [18:22:32] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [18:24:01] (03CR) 10BCornwall: [C:03+1] IDM update to Bitu 0.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1102762 (owner: 10Slyngshede) [18:28:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:29:12] (03CR) 10Scott French: [C:03+1] mediawiki: try to preserve superseded mercurius instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102915 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [18:34:06] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10401485 (10Scott_French) @Ammarpad - FYI, @thcipriani is out this week, so the next update here will likely be next week. Thanks for your patience. [18:38:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:39:26] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage [18:42:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage [18:45:58] (03CR) 10SBassett: [C:03+1] Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [19:02:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1273.eqiad.wmnet with OS bookworm [19:02:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:03:59] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1274.eqiad.wmnet with OS bookworm [19:04:06] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1274 [19:04:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1274 [19:12:00] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10401554 (10BCornwall) 05Stalled→03In progress p:05Triage→03High [19:12:38] 10ops-codfw, 06SRE, 06DC-Ops: Remove defunct lvs cross-dc links in Netbox (lvs2011 & lvs2013) - https://phabricator.wikimedia.org/T381533#10401561 (10Papaul) a:03Papaul [19:15:15] (03PS6) 10Scott French: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) [19:15:16] (03PS6) 10Scott French: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) [19:15:16] (03PS6) 10Scott French: mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) [19:15:16] (03PS6) 10Scott French: mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040) [19:15:17] (03PS2) 10Scott French: mw-(apt-ext|api-int|jobrunner|parsoid|web): set php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040) [19:15:40] (03PS5) 10Scott French: hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040) [19:15:40] (03PS4) 10Scott French: hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040) [19:15:40] (03PS2) 10Scott French: hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040) [19:16:30] jouncebot: nowandnext [19:16:30] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [19:16:30] In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T2100) [19:18:02] unless there are any objections, I might merge some changes shortly that will require a no-diff scap deployment to actuate [19:20:13] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [19:20:51] (03CR) 10Dzahn: "is it expected that https://query-scholarly.wikidata.org/querybuilder is a 404?" [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [19:21:38] (03CR) 10Dzahn: [C:03+1] "looks like a noop, yea" [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [19:22:31] moving head [19:22:39] (03CR) 10Scott French: [C:03+2] mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:23:47] (03Merged) 10jenkins-bot: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:24:08] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1274.eqiad.wmnet with reason: host reimage [19:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [19:27:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1274.eqiad.wmnet with reason: host reimage [19:27:48] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:27:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:30:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:30:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:31:09] (03CR) 10Scott French: [C:03+2] hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:33:55] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10401633 (10MoritzMuehlenhoff) [19:35:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1102762 (owner: 10Slyngshede) [19:40:18] !log swfrench@deploy2002 Started scap sync-world: Deployment to populate mw-api-int migration release files - T377040 [19:40:23] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:42:32] !log swfrench@deploy2002 Finished scap sync-world: Deployment to populate mw-api-int migration release files - T377040 (duration: 02m 13s) [19:43:36] (03CR) 10Scott French: [C:03+2] mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:44:40] (03Merged) 10jenkins-bot: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:46:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1274.eqiad.wmnet with OS bookworm [19:48:28] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1275.eqiad.wmnet with OS bookworm [19:48:36] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1275 [19:48:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1275 [19:48:42] all done for now on my end [19:53:48] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [20:21:36] (03PS1) 10CDanis: chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) [20:29:17] (03PS2) 10CDanis: chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) [20:29:17] (03PS1) 10CDanis: chart-renderer: probe the service [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) [20:29:26] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [20:32:56] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1275.eqiad.wmnet with OS bookworm [20:32:56] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1270-1275].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [20:37:19] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1275.eqiad.wmnet with OS bookworm [20:37:22] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1275 [20:37:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1275 [20:38:29] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker127[6-7].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [20:40:41] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1276.eqiad.wmnet with OS bookworm [20:40:50] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1276 [20:40:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1276 [20:41:26] (03PS1) 10Cwhite: change metric types back to counters [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) [20:42:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) (owner: 10Cwhite) [20:43:45] (03PS2) 10CDanis: chart-renderer: probe the service [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) [20:43:45] (03PS3) 10CDanis: chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) [20:43:52] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [20:44:03] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:47:15] (03CR) 10CDanis: [C:03+2] chart-renderer: probe the service [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [20:48:55] (03PS1) 10Jasmine: wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 [20:56:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1275.eqiad.wmnet with reason: host reimage [20:59:35] (03CR) 10RLazarus: [C:03+1] chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [21:00:03] (03CR) 10CDanis: [C:03+2] chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T2100). [21:00:05] kemayo, danisztls, and cwhite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:06] o/ [21:00:13] o7 [21:00:15] o/ [21:00:53] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage [21:02:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1275.eqiad.wmnet with reason: host reimage [21:02:19] * MichaelG_WMF is also around [21:05:31] (03PS2) 10Jasmine: wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) [21:05:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage [21:10:39] I can go start pinging some slack channels to see if we can rustle up a deployer, I guess... [21:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:15:28] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: anyone actually here to run the window? [21:19:28] I can do it [21:19:58] tgr|away: Awesome, thanks! [21:20:14] (03CR) 10Gergő Tisza: [C:03+2] EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch) [21:20:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1275.eqiad.wmnet with OS bookworm [21:21:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:22:28] (03CR) 10CI reject: [V:04-1] Reader Survey: Deploy on eswiki, dewiki and frwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:22:32] Sorry about that, it's my first day back from leave and I missed the ping [21:24:07] (03CR) 10Scott French: [C:03+1] "@aotto@wikimedia.org - This LGTM in the "does what it says on the tin" sense" [puppet] - 10https://gerrit.wikimedia.org/r/1063224 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [21:24:16] RoanKattouw: no worries [21:24:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:25:04] 10SRE-Access-Requests, 06Data-Engineering: Add otto to analytics-admins posix user group - https://phabricator.wikimedia.org/T382099 (10Ottomata) 03NEW [21:25:05] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:25:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1276.eqiad.wmnet with OS bookworm [21:25:18] (03Merged) 10jenkins-bot: Reader Survey: Deploy on eswiki, dewiki and frwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:25:19] (03PS1) 10Ottomata: Add otto to analytics-admins posix user group [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) [21:25:22] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1277.eqiad.wmnet with OS bookworm [21:25:37] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1102905|Reader Survey: Deploy on eswiki, dewiki and frwiki. (T378660)]] [21:25:41] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:25:42] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1277 [21:25:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1277 [21:25:42] 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add otto to analytics-admins posix user group - https://phabricator.wikimedia.org/T382099#10401795 (10Ottomata) I should already be in this group. Proceeding to add myself. [21:29:05] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:29:47] tgr|away: thanks, looks good [21:30:11] (03CR) 10CI reject: [V:04-1] Add otto to analytics-admins posix user group [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) (owner: 10Ottomata) [21:32:19] (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) (owner: 10Ottomata) [21:32:34] !log tgr@deploy2002 dani, tgr: Backport for [[gerrit:1102905|Reader Survey: Deploy on eswiki, dewiki and frwiki. (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:39] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:32:52] !log bking@gitlab-runner2004 restart docker to troubleshoot missing iptables rules T371994 [21:32:55] !log tgr@deploy2002 dani, tgr: Continuing with sync [21:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:55] T371994: Deploy the HDFS synchronizer (blunderbuss) service to the dse-k8s cluster - https://phabricator.wikimedia.org/T371994 [21:35:24] !log bking@gitlab-runner2004 restart ferm to troubleshoot missing iptables rules T371994 [21:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:19] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102905|Reader Survey: Deploy on eswiki, dewiki and frwiki. (T378660)]] (duration: 12m 42s) [21:38:23] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:39:40] (03Merged) 10jenkins-bot: EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch) [21:44:49] (03CR) 10Gergő Tisza: [C:03+2] change metric types back to counters [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) (owner: 10Cwhite) [21:45:43] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1102885|EditCheck: move checks to a sidebar (T341308 T379443)]] [21:45:49] T341308: Present people with multiple reference checks when warranted - https://phabricator.wikimedia.org/T341308 [21:45:49] T379443: Hide Vector 2022 Tools and Appearance menus when Edit Check has the potential to activate - https://phabricator.wikimedia.org/T379443 [21:46:12] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage [21:46:18] (03CR) 10Ottomata: [C:03+2] Add otto to analytics-admins posix user group [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) (owner: 10Ottomata) [21:47:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add otto to analytics-admins posix user group - https://phabricator.wikimedia.org/T382099#10401827 (10Ottomata) 05Open→03Resolved [21:49:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage [21:51:48] (03PS1) 10Jforrester: Provide a base iamge for Rust 1.63, based on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) [22:02:16] !log tgr@deploy2002 tgr, kemayo: Backport for [[gerrit:1102885|EditCheck: move checks to a sidebar (T341308 T379443)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:02:25] T341308: Present people with multiple reference checks when warranted - https://phabricator.wikimedia.org/T341308 [22:02:25] T379443: Hide Vector 2022 Tools and Appearance menus when Edit Check has the potential to activate - https://phabricator.wikimedia.org/T379443 [22:03:00] tgr|away: I've checked on deploy2002 and it all looks good. [22:03:51] !log tgr@deploy2002 tgr, kemayo: Continuing with sync [22:04:35] (03Merged) 10jenkins-bot: change metric types back to counters [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) (owner: 10Cwhite) [22:06:38] (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) (owner: 10Jasmine) [22:08:02] !log bking@cumin2002 sudo cumin A:gitlab-runner 'systemctl restart ferm.service' T371994 [22:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:06] T371994: Deploy the HDFS synchronizer (blunderbuss) service to the dse-k8s cluster - https://phabricator.wikimedia.org/T371994 [22:09:05] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:09:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1277.eqiad.wmnet with OS bookworm [22:09:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker127[6-7].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [22:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:56] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102885|EditCheck: move checks to a sidebar (T341308 T379443)]] (duration: 29m 12s) [22:15:01] T341308: Present people with multiple reference checks when warranted - https://phabricator.wikimedia.org/T341308 [22:15:01] T379443: Hide Vector 2022 Tools and Appearance menus when Edit Check has the potential to activate - https://phabricator.wikimedia.org/T379443 [22:16:27] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1102959|change metric types back to counters (T374050)]] [22:16:30] T374050: Migrate GrowthExperiments.NewcomerTask Module to statslib - https://phabricator.wikimedia.org/T374050 [22:20:26] !log tgr@deploy2002 tgr, cwhite: Backport for [[gerrit:1102959|change metric types back to counters (T374050)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:27:47] cwhite: do you want to check on mwdebug? [22:28:02] tgr|away: mwscript finished, looks good! [22:30:20] !log tgr@deploy2002 tgr, cwhite: Continuing with sync [22:35:37] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102959|change metric types back to counters (T374050)]] (duration: 19m 10s) [22:35:41] T374050: Migrate GrowthExperiments.NewcomerTask Module to statslib - https://phabricator.wikimedia.org/T374050 [22:35:55] Thank you! [22:36:21] !log UTC late deploys done [22:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:41:02] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10401925 (10Volans) [22:41:19] (03PS1) 10Mstyles: security-landing-page: deploying updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102989 (https://phabricator.wikimedia.org/T381430) [23:20:13] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [23:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [23:46:49] RECOVERY - MD RAID on aqs1014 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:58:31] RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards