[00:38:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1102450
[00:38:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1102450 (owner: 10TrainBranchBot)
[00:56:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1102450 (owner: 10TrainBranchBot)
[01:08:12] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1102457
[01:08:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1102457 (owner: 10TrainBranchBot)
[01:13:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:19:44] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[01:19:44] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[01:26:46] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[01:26:46] <jinxer-wm>	 Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[01:27:26] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1102457 (owner: 10TrainBranchBot)
[02:09:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:09:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10399535 (10phaultfinder)
[05:13:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:19:44] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[05:19:44] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[05:26:46] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[05:26:46] <jinxer-wm>	 Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[06:09:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:30:07] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 13497MiB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[06:31:07] <wikibugs>	 (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T0700).
[07:04:57] <wikibugs>	 06SRE: (blank/unknown) Error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048 (10JJMC89) 03NEW
[07:18:31] <wikibugs>	 (03PS1) 10Slyngshede: C:ldap::management default mfa to webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1102699
[07:21:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:22:07] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:37:39] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Bitu: Permission request state isn't refreshed if access has been revoked - https://phabricator.wikimedia.org/T382051 (10MoritzMuehlenhoff) 03NEW
[07:44:17] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker20(47|66|85|86) [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788)
[07:46:26] <wikibugs>	 (03CR) 10Jelto: "The `add_k8s_node.py` script would re-use the wikikube-worker ids from the decommissioning in T379788. I guess we don't want to use the id" [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto)
[07:46:39] <wikibugs>	 (03PS3) 10Abijeet Patro: Translate: Enable message group subscription for 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386)
[07:51:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1102699 (owner: 10Slyngshede)
[07:52:14] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:ldap::management default mfa to webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1102699 (owner: 10Slyngshede)
[07:52:55] <moritzm>	 !log installing upx-ucl security updates
[07:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:25] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[07:56:28] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T0800).
[08:00:05] <jouncebot>	 abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:05] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:00:13] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:00:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1023.eqiad.wmnet with reason: maintenance
[08:00:51] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:00:53] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:00:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1023.eqiad.wmnet with reason: maintenance
[08:02:16] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage es2045 [puppet] - 10https://gerrit.wikimedia.org/r/1102723
[08:02:27] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:02:30] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:02:37] <kart_>	 abijeet: here?
[08:03:17] <abijeet>	 kart_, yup
[08:03:41] <kart_>	 I can deploy your patch. Going ahead. It had CI failure early, but recheck fixed it.
[08:04:55] <abijeet>	 kart_, thanks
[08:05:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[08:06:26] <wikibugs>	 (03Merged) 10jenkins-bot: Translate: Enable message group subscription for 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[08:07:13] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1102283|Translate: Enable message group subscription for 7 wikis (T372386)]]
[08:07:17] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[08:09:58] <wikibugs>	 (03CR) 10Muehlenhoff: Inform users that their permission request have been approved/rejected (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede)
[08:11:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1024.eqiad.wmnet with reason: maintenance
[08:11:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1024.eqiad.wmnet with reason: maintenance
[08:12:15] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1102283|Translate: Enable message group subscription for 7 wikis (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:12:18] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[08:12:35] <wikibugs>	 (03PS2) 10Slyngshede: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894
[08:12:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2045 [puppet] - 10https://gerrit.wikimedia.org/r/1102723 (owner: 10Marostegui)
[08:14:16] <kart_>	 abijeet: Please test!
[08:14:33] <wikibugs>	 06SRE: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10399776 (10JJMC89)
[08:14:42] <abijeet>	 kart_, ok, testing
[08:16:26] <wikibugs>	 06SRE: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10399777 (10JJMC89) We should not be getting a 429 error (with no content) after a one (or two) accounts.
[08:18:30] <JJMC89>	 not sure what the right tags for ^ are but assistance with getting the right people to look at it is welcome
[08:19:13] <abijeet>	 kart_, tested on mwdebug2001 server. looks ok.
[08:20:58] <kart_>	 Nice!
[08:21:09] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Continuing with sync
[08:21:23] <wikibugs>	 (03PS1) 10Brouberol: mw-content-history-reconcile-enrich: add missing s3 configuration keys [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102729 (https://phabricator.wikimedia.org/T375176)
[08:22:04] <wikibugs>	 (03CR) 10JMeybohm: "Hm...yeah. Maybe wise to wait with reusing those until dc-ops has finished the decom." [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto)
[08:22:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede)
[08:23:43] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:23:46] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:26:17] <wikibugs>	 (03PS3) 10Slyngshede: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894
[08:26:27] <wikibugs>	 (03CR) 10Slyngshede: Inform users that their permission request have been approved/rejected (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede)
[08:27:19] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102283|Translate: Enable message group subscription for 7 wikis (T372386)]] (duration: 20m 05s)
[08:27:22] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[08:32:07] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede)
[08:37:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede)
[08:39:23] <wikibugs>	 (03Merged) 10jenkins-bot: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 (owner: 10Slyngshede)
[08:43:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1026.eqiad.wmnet with reason: maintenance
[08:43:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1026.eqiad.wmnet with reason: maintenance
[08:54:22] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[08:54:25] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:01:13] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102729 (https://phabricator.wikimedia.org/T375176) (owner: 10Brouberol)
[09:01:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mw-content-history-reconcile-enrich: add missing s3 configuration keys [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102729 (https://phabricator.wikimedia.org/T375176) (owner: 10Brouberol)
[09:02:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbproxy1027.eqiad.wmnet with reason: maintenance
[09:02:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1027.eqiad.wmnet with reason: maintenance
[09:04:56] <wikibugs>	 (03PS1) 10KartikMistry: Enable the Contribute menu in 5th group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102733 (https://phabricator.wikimedia.org/T380928)
[09:07:47] <wikibugs>	 (03CR) 10Gmodena: rdf-streaming-updater: add wdqs udpater streams in event stream config (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:13:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:17:33] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:17:38] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:18:48] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:18:53] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:19:44] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[09:19:44] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[09:22:37] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:22:42] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[09:26:46] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[09:26:46] <jinxer-wm>	 Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[09:34:30] <wikibugs>	 (03CR) 10Volans: "General approach LGTM, couple of nits inline. I'll leave the review of the specific k8s logic to your team." [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[09:36:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 (owner: 10Slyngshede)
[09:39:07] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Release v0.1.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 (owner: 10Slyngshede)
[09:43:08] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.1.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 (owner: 10Slyngshede)
[09:48:40] <wikibugs>	 (03PS1) 10Brouberol: flink-operator: add mw-content-history-reconcile-enrich-next namespaces to the list of tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102741 (https://phabricator.wikimedia.org/T381322)
[09:59:08] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] flink-operator: add mw-content-history-reconcile-enrich-next namespaces to the list of tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102741 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol)
[09:59:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] flink-operator: add mw-content-history-reconcile-enrich-next namespaces to the list of tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102741 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol)
[09:59:27] <wikibugs>	 06SRE, 10SRE-swift-storage: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056 (10MatthewVernon) 03NEW
[10:00:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:00:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:06:46] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[10:06:46] <jinxer-wm>	 Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[10:09:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:11:59] <wikibugs>	 (03PS1) 10Elukey: Add maps-master{eqiad,codfw} among the postgres dst nets [puppet] - 10https://gerrit.wikimedia.org/r/1102744
[10:15:00] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4671/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102744 (owner: 10Elukey)
[10:21:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes2046.codfw.wmnet, wikikube-worker2170.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2174.codfw.wmnet, wikikube-worker2120.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2172.codfw.wmnet
[10:21:13] <icinga-wm>	 .codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, wikikube-worker2155.codfw.wmnet, kubernetes2052.codfw.wmnet, wikikube-worker2136.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2132.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker20
[10:21:13] <icinga-wm>	 .wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2352.codfw.wmnet, wikik https://wikitech.wikimedia.org/wiki/PyBal
[10:21:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1102744 (owner: 10Elukey)
[10:22:10] <jinxer-wm>	 FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:22:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2079.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2172.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2370.codf
[10:22:13] <icinga-wm>	  wikikube-worker2136.codfw.wmnet, wikikube-worker2185.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2130.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2097.codfw.wmnet, mw2371.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikik
[10:22:13] <icinga-wm>	 er2090.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2114.codfw.wmnet, wikikube-worker2062.codfw.wmnet, wikikube-worker2164.codfw.wmnet, wikikube-worker2123.codfw.wmnet, wikik https://wikitech.wikimedia.org/wiki/PyBal
[10:22:37] <wikibugs>	 (03PS1) 10Slyngshede: IDM update to Bitu 0.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1102762
[10:23:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:23:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:24:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:26:08] <wikibugs>	 (03PS1) 10MVernon: swift: add new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102763 (https://phabricator.wikimedia.org/T382056)
[10:26:09] <wikibugs>	 (03PS1) 10MVernon: swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1102764 (https://phabricator.wikimedia.org/T382056)
[10:26:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmn
[10:26:13] <icinga-wm>	 rnetes2052.codfw.wmnet, wikikube-worker2150.codfw.wmnet, wikikube-worker2136.codfw.wmnet, wikikube-worker2185.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2130.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2139.codfw.w
[10:26:13] <icinga-wm>	 2352.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, kubernetes2037.codfw.wmnet, wikikube-worker2125.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359. https://wikitech.wikimedia.org/wiki/PyBal
[10:26:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers wikikube-worker2050.codfw.wmnet, wikikube-worker2174.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2171.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-wo
[10:26:13] <icinga-wm>	 .codfw.wmnet, wikikube-worker2132.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2177.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-wor
[10:26:13] <icinga-wm>	 codfw.wmnet, wikikube-worker2151.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2159.codfw.wmnet, wikikube-worker2124.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikub https://wikitech.wikimedia.org/wiki/PyBal
[10:27:22] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks for the context Jelto. We're happy that this is only a trigger mechanism for deploying artifacts, and we are planning to use authen" [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic)
[10:27:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Add Blunderbuss firewall rule to GitLab runner set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic)
[10:27:50] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] [dumps] Increase the lbzip2 thread count for large wikis [puppet] - 10https://gerrit.wikimedia.org/r/1100498 (https://phabricator.wikimedia.org/T380729) (owner: 10Btullis)
[10:28:57] <jinxer-wm>	 FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:29:27] <godog>	 checking
[10:29:49] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: enable the support of multiple executors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102312 (https://phabricator.wikimedia.org/T362788) (owner: 10Brouberol)
[10:30:25] <godog>	 fabfur: I've acked the page FYI
[10:30:56] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1101903 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney)
[10:31:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[10:31:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] yarn: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100476 (owner: 10Muehlenhoff)
[10:32:05] <godog>	 mmhh looks like overload? https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&refresh=1m&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-video&var-release=main
[10:32:21] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Update JunOS templates to use VRRP priority exposed from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1101861 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney)
[10:33:04] <claime>	 yeah, seems like a bunch of transcodes got scheduled maybe
[10:33:17] <claime>	 hnowlan: you around?
[10:33:36] <claime>	 we could scale it up maybe
[10:33:37] <wikibugs>	 (03Merged) 10jenkins-bot: Update JunOS templates to use VRRP priority exposed from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1101861 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney)
[10:33:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:34:09] <godog>	 claime: might be worth a try yeah, nudge it a little bit
[10:34:15] <jinxer-wm>	 FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:34:30] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[10:34:35] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[10:34:37] <godog>	 shellbox-video ate too much in preparation for xmas
[10:35:04] <godog>	 claime: would you mind doing the honors of scaling up ?
[10:35:12] <claime>	 yeah on it
[10:35:24] <godog>	 thank you
[10:36:26] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[10:36:31] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[10:38:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:38:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:38:34] <wikibugs>	 (03PS1) 10Clément Goubert: shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767
[10:39:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:40:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 (owner: 10Clément Goubert)
[10:40:59] <claime>	 what's weird is I'm not seeing a particular uptick in webvideotranscode jobs in jobqueue or in mercurius
[10:41:36] <claime>	 well there's an uptick but doesn't seem like it would warrant a complete saturation like that
[10:41:43] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 (owner: 10Clément Goubert)
[10:43:01] <godog>	 interesting, maybe a few video whales
[10:43:04] <claime>	 maybe it's just got long-running transcodes
[10:43:13] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: scale up to 60 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102767 (owner: 10Clément Goubert)
[10:43:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[10:43:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[10:43:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[10:44:04] <godog>	 could be too yeah, there was an increase in rps for sure
[10:44:27] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002
[10:48:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002
[10:49:16] <claime>	 godog: what may happen is that helmfile will consider that the unavailable replicas make the deployment fail
[10:49:30] <hnowlan>	 claime: erk, thanks for handling that
[10:49:35] <hnowlan>	 that.. shouldn't happen
[10:49:36] <claime>	 so I may have to wait for it to roll back, scale up manually, then helmfile apply
[10:49:49] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400061 (10Krd) Request from <redacted>:1227:f5ff:fec7:7ec1 via cp3070 cp3070, Varnish XID 926533531 Upstream caches: cp3070 int Error: 429, at Thu, 12 Dec 2024 10:46:40 GMT  The same problem appears again and ag...
[10:50:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable the support of multiple executors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102312 (https://phabricator.wikimedia.org/T362788) (owner: 10Brouberol)
[10:50:51] <claime>	 hnowlan: scaling up basically just made the rps jump up to match the number of new replicas...
[10:51:35] <hnowlan>	 claime: old instances that are tied up will stay tied up so that's not a huge surprise
[10:52:00] <hnowlan>	 RPS doesn't have much of a direct relation to capacity for -video because one request can mean an instance is in use for hours 
[10:52:09] <hnowlan>	 there are free healthy instances, that's all we care about for now
[10:52:35] <claime>	 I'm not sure why since mercurius reports barely any jobs, and there's not that many jobs being enqueued, but that may still be enough to tie up all instances
[10:52:52] <godog>	 ack thank you, but yeah the probe recovered even before scaling up
[10:53:01] <claime>	 (still waiting on helmfile to fail btw...)
[10:53:03] <godog>	 I guess shellbox-video burped and then moved on
[10:53:14] <hnowlan>	 I'm tuning concurrency in mercurius downwards
[10:53:45] <hnowlan>	 some of this is also related to mercurius losing track of jobs, which I am merging patches for today hopefully 
[10:53:45] <claime>	 hnowlan: ack
[10:54:08] <wikibugs>	 (03PS1) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835)
[10:54:31] <wikibugs>	 (03PS1) 10Hnowlan: mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784
[10:55:03] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4672/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[10:55:35] <wikibugs>	 (03CR) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[10:56:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 (owner: 10Hnowlan)
[10:56:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 (owner: 10Hnowlan)
[10:56:29] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 196, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:57:54] <wikibugs>	 (03Merged) 10jenkins-bot: mw-videoscaler: lower concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102784 (owner: 10Hnowlan)
[10:57:59] <claime>	 helmfile is rolling back
[10:58:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: s.m.a.r.t. - Exclude zram devices from data export (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[10:58:31] <claime>	 13m         Normal    ScalingReplicaSet   deployment/shellbox-main              Scaled up replica set shellbox-main-5db67dd6fc to 60
[10:58:32] <claime>	 3m38s       Normal    ScalingReplicaSet   deployment/shellbox-main              Scaled down replica set shellbox-main-5db67dd6fc to 48
[10:58:47] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 277, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:59:29] <godog>	 ouch, yeah default helmfile behaviour is understandable, and suboptimal for sure in this case
[10:59:55] <claime>	 yeah, the "availability" for shellbox-video is a bit of a tricky concept
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1100)
[11:00:21] <godog>	 indeed
[11:00:22] <claime>	 we *want* the replicas to be unavailable while treating a request
[11:00:46] <claime>	 but that makes scaling up through helmfile a problem when overloaded
[11:01:03] <godog>	 tricky alright in this case
[11:01:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:01:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[11:01:59] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[11:02:28] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[11:02:35] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[11:03:22] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002
[11:03:33] <wikibugs>	 (03PS2) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835)
[11:03:44] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[11:03:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[11:04:02] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[11:04:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[11:04:14] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[11:04:21] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4673/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[11:04:56] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002
[11:06:17] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Expose VRRP group assignment priority to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1101903 (https://phabricator.wikimedia.org/T381873) (owner: 10Cathal Mooney)
[11:07:17] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002
[11:07:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[11:07:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:07:50] <claime>	 ok I've scaled it up manually, I've helmfile apply so it's synced up
[11:08:09] <hnowlan>	 thanks! I've lowered concurrency so once those jobs fail it'll be a lot quieter 
[11:08:22] <godog>	 thank you folks, appreciate it
[11:08:33] <hnowlan>	 I *think* I know what caused this (alongside the concurrency being high) so I will have a fix in place in an hour or two 
[11:08:51] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: update Homer wmf-plugin - cmooney@cumin1002
[11:09:15] <claime>	 hnowlan: the brunt of it seems to have passed, looking at the network rx
[11:09:48] <hnowlan>	 my theory is that mercurius lost track of a bunch of jobs at once, and then retried
[11:09:52] <hnowlan>	 but shellbox was still processing them
[11:09:57] <claime>	 hmh
[11:10:24] <wikibugs>	 (03PS3) 10Hnowlan: base: fix pin on base.meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307
[11:12:28] <wikibugs>	 (03PS3) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835)
[11:12:37] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "I've updated this patch to do that - thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 (owner: 10Hnowlan)
[11:13:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[11:13:15] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4674/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[11:13:45] <wikibugs>	 (03Merged) 10jenkins-bot: base: fix pin on base.meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 (owner: 10Hnowlan)
[11:14:09] <godog>	 going to lunch
[11:15:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[11:16:51] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@0e18d4f]: Backfill webrequest actor label hourly 2024 12
[11:19:43] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@0e18d4f]: Backfill webrequest actor label hourly 2024 12 (duration: 02m 52s)
[11:22:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2186.codfw.wmnet with reason: maintenance
[11:22:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2186.codfw.wmnet with reason: maintenance
[11:23:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] swift: add new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102763 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[11:24:27] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1102764 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[11:25:38] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] Add maps-master{eqiad,codfw} among the postgres dst nets [puppet] - 10https://gerrit.wikimedia.org/r/1102744 (owner: 10Elukey)
[11:26:49] <wikibugs>	 (03PS2) 10أنون: [enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421)
[11:27:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2186.codfw.wmnet with reason: maintenance
[11:27:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2186.codfw.wmnet with reason: maintenance
[11:27:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2187.codfw.wmnet with reason: maintenance
[11:28:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2187.codfw.wmnet with reason: maintenance
[11:29:12] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[11:29:53] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: add new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102763 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[11:30:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10400168 (10Clement_Goubert)
[11:32:22] <wikibugs>	 (03PS4) 10Btullis: s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835)
[11:32:59] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'.
[11:33:06] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[11:33:14] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4675/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[11:33:21] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[11:33:37] <wikibugs>	 (03CR) 10Btullis: [V:03+1] s.m.a.r.t. - Exclude zram devices from data export (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[11:33:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[11:34:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:35:17] <icinga-wm>	 PROBLEM - Host ms-be1084 is DOWN: PING CRITICAL - Packet loss = 100%
[11:35:37] <icinga-wm>	 PROBLEM - Host ms-be1087 is DOWN: PING CRITICAL - Packet loss = 100%
[11:35:59] <icinga-wm>	 PROBLEM - Host ms-be1083 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:02] <wikibugs>	 (03PS6) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701)
[11:36:15] <icinga-wm>	 PROBLEM - Host ms-be1085 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:15] <icinga-wm>	 PROBLEM - Host ms-be1090 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:16] <wikibugs>	 (03PS4) 10Elukey: services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897
[11:36:16] <wikibugs>	 (03PS10) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826)
[11:36:16] <wikibugs>	 (03PS6) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826)
[11:36:16] <wikibugs>	 (03PS7) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826)
[11:36:17] <icinga-wm>	 PROBLEM - Host ms-be1088 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:27] <icinga-wm>	 PROBLEM - Host ms-be1089 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:27] <icinga-wm>	 PROBLEM - Host ms-be1086 is DOWN: PING CRITICAL - Packet loss = 100%
[11:36:53] <icinga-wm>	 RECOVERY - Host ms-be1088 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[11:36:55] <icinga-wm>	 RECOVERY - Host ms-be1084 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[11:36:55] <icinga-wm>	 RECOVERY - Host ms-be1086 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[11:36:56] <icinga-wm>	 RECOVERY - Host ms-be1089 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[11:37:13] <icinga-wm>	 RECOVERY - Host ms-be1083 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms
[11:37:13] <icinga-wm>	 RECOVERY - Host ms-be1090 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[11:37:13] <icinga-wm>	 RECOVERY - Host ms-be1087 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[11:37:13] <icinga-wm>	 RECOVERY - Host ms-be1085 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[11:37:30] <wikibugs>	 (03PS1) 10Slyngshede: P:idm_test adjust idptest-users requirement [puppet] - 10https://gerrit.wikimedia.org/r/1102793
[11:38:13] <wikibugs>	 (03CR) 10Elukey: "I simplified the change and allowed also the master replica to be contacted, we can review the choice later." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey)
[11:39:36] <wikibugs>	 (03CR) 10Elukey: services: add helmfile config for Kartotherian (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[11:40:43] <icinga-wm>	 PROBLEM - Host ms-be2086 is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:43] <icinga-wm>	 PROBLEM - Host ms-be2084 is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:43] <icinga-wm>	 PROBLEM - Host ms-be2083 is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:43] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[11:40:45] <icinga-wm>	 PROBLEM - Host ms-be2087 is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:53] <icinga-wm>	 PROBLEM - Host ms-be2085 is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:40:59] <icinga-wm>	 PROBLEM - Host ms-be2081 is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:59] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[11:41:03] <icinga-wm>	 PROBLEM - Host ms-be2082 is DOWN: PING CRITICAL - Packet loss = 100%
[11:41:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1154.eqiad.wmnet with reason: maintenance
[11:41:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1154.eqiad.wmnet with reason: maintenance
[11:41:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:41:41] <icinga-wm>	 RECOVERY - Host ms-be2087 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms
[11:41:41] <icinga-wm>	 RECOVERY - Host ms-be2085 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[11:41:41] <icinga-wm>	 RECOVERY - Host ms-be2084 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms
[11:41:41] <icinga-wm>	 RECOVERY - Host ms-be2082 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms
[11:41:41] <icinga-wm>	 RECOVERY - Host ms-be2086 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[11:41:43] <icinga-wm>	 RECOVERY - Host ms-be2081 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms
[11:41:47] <icinga-wm>	 RECOVERY - Host ms-be2083 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms
[11:42:32] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[11:43:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey)
[11:44:22] <hnowlan>	 jouncebot: nowandnext
[11:44:22] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1100)
[11:44:22] <jouncebot>	 In 1 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1300)
[11:44:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:45:26] <hnowlan>	 I'm going to do a sync-world to roll out changes to the mediawiki chart's dependencies in a few minutes, speak now or etc etc 
[11:48:46] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: add new nodes, drain old nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1102764 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon)
[12:01:57] <elukey>	 Emperor: just checking, are all those up/down notifications expected?
[12:03:29] <logmsgbot>	 !log hnowlan@deploy2002 Started scap sync-world: syncing changes to mediawiki chart vendor dependencies
[12:04:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1102794 (https://phabricator.wikimedia.org/T382062)
[12:04:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1102794 (https://phabricator.wikimedia.org/T382062) (owner: 10Giuseppe Lavagetto)
[12:06:22] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002 - T382062"
[12:06:25] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 - T382062
[12:06:59] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1002 - T382062
[12:07:00] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1002 - T382062"
[12:09:44] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[12:09:44] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[12:10:48] <logmsgbot>	 !log hnowlan@deploy2002 Finished scap sync-world: syncing changes to mediawiki chart vendor dependencies  (duration: 09m 30s)
[12:15:27] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[12:15:32] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[12:15:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[12:15:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[12:18:35] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400297 (10Clement_Goubert) Hi,  Sorry for the delay in responding.  Could you describe how you are creating the accounts? Are you using your browser and the `Special:CreateAccount` page, a script, a gadget? This...
[12:18:52] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700)
[12:20:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[12:23:02] <wikibugs>	 (03PS1) 10Máté Szabó: Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061)
[12:24:44] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[12:24:44] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[12:26:54] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Exclude zram devices from disk health checks - https://phabricator.wikimedia.org/T380835#10400317 (10BTullis)
[12:29:57] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938
[12:30:17] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938
[12:31:12] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938
[12:35:37] <wikibugs>	 (03CR) 10Harroyo-wmf: [C:03+1] Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó)
[12:36:19] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: Restarting to pick up new JRE for T377938 - btullis@cumin1002 - T377938
[12:41:06] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400370 (10Krd) Special:CreateAccount in a normal browser.
[12:46:14] <wikibugs>	 (03CR) 10Ammarpad: Enable IRS in the Project namespace on ptwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó)
[12:50:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:50:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:53:32] <wikibugs>	 (03PS1) 10Hnowlan: mesh.configuration: dummy commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102838
[12:54:07] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó)
[12:55:02] <mszabo>	 jouncebot: nowandnext
[12:55:02] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 4 minute(s)
[12:55:02] <jouncebot>	 In 0 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1300)
[12:56:30] <wikibugs>	 (03PS2) 10Máté Szabó: Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061)
[12:56:34] <wikibugs>	 (03CR) 10Máté Szabó: Enable IRS in the Project namespace on ptwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó)
[12:59:01] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102839 (https://phabricator.wikimedia.org/T371701)
[12:59:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102839 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1300)
[13:00:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] P:idm_test adjust idptest-users requirement [puppet] - 10https://gerrit.wikimedia.org/r/1102793 (owner: 10Slyngshede)
[13:01:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó)
[13:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable IRS in the Project namespace on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102813 (https://phabricator.wikimedia.org/T382061) (owner: 10Máté Szabó)
[13:02:17] <logmsgbot>	 !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1102813|Enable IRS in the Project namespace on ptwiki (T382061)]]
[13:02:21] <stashbot>	 T382061: Enable incident reporting on project namespace for ptwiki - https://phabricator.wikimedia.org/T382061
[13:03:10] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701)
[13:03:13] <wikibugs>	 (03Abandoned) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102839 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[13:05:23] <logmsgbot>	 !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1102813|Enable IRS in the Project namespace on ptwiki (T382061)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:06:09] <wikibugs>	 (03PS1) 10Slyngshede: Notifications: Improve wording in topic [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843
[13:06:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm_test adjust idptest-users requirement [puppet] - 10https://gerrit.wikimedia.org/r/1102793 (owner: 10Slyngshede)
[13:06:32] <logmsgbot>	 !log mszabo@deploy2002 mszabo: Continuing with sync
[13:11:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 (owner: 10Slyngshede)
[13:11:58] <logmsgbot>	 !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102813|Enable IRS in the Project namespace on ptwiki (T382061)]] (duration: 09m 41s)
[13:12:02] <stashbot>	 T382061: Enable incident reporting on project namespace for ptwiki - https://phabricator.wikimedia.org/T382061
[13:13:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:14:34] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Notifications: Improve wording in topic [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 (owner: 10Slyngshede)
[13:15:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[13:15:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc[1013,1017].eqiad.wmnet with reason: maintenance
[13:15:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[1013,1017].eqiad.wmnet with reason: maintenance
[13:15:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2127.codfw.wmnet
[13:16:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2127.codfw.wmnet
[13:16:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "IIRC alerts should be undeployed automatically, easy to check post puppet-run on prometheus hosts whether /srv/alerts/ops/team-o11y_sli.ya" [alerts] - 10https://gerrit.wikimedia.org/r/1102366 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[13:17:45] <wikibugs>	 (03CR) 10DCausse: rdf-streaming-updater: add wdqs udpater streams in event stream config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[13:18:10] <wikibugs>	 (03PS1) 10KartikMistry: Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889)
[13:18:24] <wikibugs>	 (03Merged) 10jenkins-bot: Notifications: Improve wording in topic [software/bitu] - 10https://gerrit.wikimedia.org/r/1102843 (owner: 10Slyngshede)
[13:18:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2127.codfw.wmnet with OS bookworm
[13:18:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2127
[13:18:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2127
[13:19:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc[2014,2016].codfw.wmnet with reason: maintenance
[13:19:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2014,2016].codfw.wmnet with reason: maintenance
[13:20:23] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400469 (10Clement_Goubert) I've tweaked a rate limiting rule, could you please try again?   If possible, could you enable the developer tools in your browser, and get the request headers sent in the `POST` reque...
[13:20:39] <wikibugs>	 (03PS3) 10Volans: cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258)
[13:20:39] <wikibugs>	 (03CR) 10Volans: "ready" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[13:21:08] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey)
[13:21:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[13:22:31] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:22:56] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: generate names for mercurius jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[13:29:49] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[13:30:12] <wikibugs>	 (03CR) 10JMeybohm: [WIP, DNM] create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[13:32:53] <moritzm>	 !log rebalance Ganeti cluster in codfw/D following server refresh T376594
[13:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:57] <stashbot>	 T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594
[13:33:05] <wikibugs>	 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#10400510 (10Krd) The account was created in the meantime. I suggest I test this at the next opportunity and report here. Thank you!
[13:36:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[13:36:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[13:36:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T381532)', diff saved to https://phabricator.wikimedia.org/P71703 and previous config saved to /var/cache/conftool/dbconfig/20241212-133633-marostegui.json
[13:36:40] <stashbot>	 T381532: Fix AntiSpoof database schema drifts in production - https://phabricator.wikimedia.org/T381532
[13:37:10] <jinxer-wm>	 FIRING: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:38:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage
[13:38:55] <wikibugs>	 (03CR) 10Elukey: [C:03+1] cookbook: add owner_team property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[13:39:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:40:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10400526 (10MoritzMuehlenhoff)
[13:40:40] <wikibugs>	 (03PS1) 10Milimetric: analytics/html: update readme for MW history dump [puppet] - 10https://gerrit.wikimedia.org/r/1102848 (https://phabricator.wikimedia.org/T381390)
[13:41:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2127.codfw.wmnet with reason: host reimage
[13:41:14] <moritzm>	 !log installing Python 3.11 security updates
[13:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:47] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] s.m.a.r.t. - Exclude zram devices from data export [puppet] - 10https://gerrit.wikimedia.org/r/1102783 (https://phabricator.wikimedia.org/T380835) (owner: 10Btullis)
[13:44:05] <wikibugs>	 (03PS1) 10Muehlenhoff: (Re)assign builder role to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102849 (https://phabricator.wikimedia.org/T379343)
[13:46:18] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons.
[13:46:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[13:46:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1169.eqiad.wmnet with reason: maintenance
[13:46:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: maintenance
[13:47:02] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[13:47:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[13:48:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[13:48:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71704 and previous config saved to /var/cache/conftool/dbconfig/20241212-134824-root.json
[13:48:35] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[13:48:43] <wikibugs>	 (03PS1) 10Muehlenhoff: miscweb: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1102850
[13:48:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:52:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102850 (owner: 10Muehlenhoff)
[13:52:39] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons.
[13:52:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync
[13:53:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync
[13:53:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] (Re)assign builder role to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102849 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff)
[13:54:01] <wikibugs>	 (03PS4) 10Volans: cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258)
[13:54:04] <wikibugs>	 (03CR) 10Volans: cookbook: add owner_team property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[13:55:47] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons.
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:48] <Lucas_WMDE>	 nothing to deploy :)
[14:01:34] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:01:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2127.codfw.wmnet with OS bookworm
[14:01:57] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons.
[14:02:40] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Update notes for codfw proxies [puppet] - 10https://gerrit.wikimedia.org/r/1102852 (https://phabricator.wikimedia.org/T381962)
[14:03:00] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[14:03:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2127.codfw.wmnet
[14:03:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2127.codfw.wmnet
[14:03:17] <Emperor>	 elukey: sorry, IRC client hid this, but yes - I rebooted all the new nodes (SOP part of bringing them online)
[14:03:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[14:03:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71705 and previous config saved to /var/cache/conftool/dbconfig/20241212-140329-root.json
[14:03:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Update notes for codfw proxies [puppet] - 10https://gerrit.wikimedia.org/r/1102852 (https://phabricator.wikimedia.org/T381962) (owner: 10Marostegui)
[14:04:48] <wikibugs>	 06SRE, 06serviceops: HTTP 429 error on VRT wiki trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10400598 (10Aklapper)
[14:04:50] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet
[14:09:04] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Exclude zram devices from disk health checks - https://phabricator.wikimedia.org/T380835#10400604 (10BTullis) 05Open→03Resolved
[14:09:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:15:13] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host db1208.eqiad.wmnet
[14:17:50] <wikibugs>	 (03PS1) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[14:18:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[14:18:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71706 and previous config saved to /var/cache/conftool/dbconfig/20241212-141835-root.json
[14:19:56] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4676/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[14:21:16] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4677/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[14:30:08] <Amir1>	 !log ladsgroup@mwmaint2002:~$ foreachwikiindblist all userOptions.php --delete VectorSkinVersion (T54777)
[14:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:13] <stashbot>	 T54777: user_properties table bloat - https://phabricator.wikimedia.org/T54777
[14:30:22] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] rdf-streaming-updater: add wdqs udpater streams in event stream config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[14:31:18] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney)
[14:33:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71707 and previous config saved to /var/cache/conftool/dbconfig/20241212-143340-root.json
[14:39:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[14:40:00] <wikibugs>	 (03CR) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[14:40:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mesh.configuration: dummy commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102838 (owner: 10Hnowlan)
[14:41:34] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.configuration: dummy commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102838 (owner: 10Hnowlan)
[14:41:55] <wikibugs>	 (03PS2) 10Hnowlan: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701)
[14:45:19] <wikibugs>	 (03PS2) 10Hnowlan: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700)
[14:45:27] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[14:47:03] <wikibugs>	 (03PS11) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826)
[14:47:09] <wikibugs>	 (03PS7) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826)
[14:47:13] <wikibugs>	 (03PS8) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826)
[14:47:24] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: use mesh.configuration 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102841 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[14:48:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71708 and previous config saved to /var/cache/conftool/dbconfig/20241212-144846-root.json
[14:53:08] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs: Upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1102377 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans)
[14:53:55] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[14:54:00] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[14:54:14] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[14:54:27] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[14:54:29] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:54:35] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:55:37] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[14:55:41] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[14:57:52] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:57:55] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:58:04] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:58:07] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:58:26] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:58:31] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[15:02:11] <wikibugs>	 (03PS1) 10Ladsgroup: Add tigwiki to pre-install [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102872 (https://phabricator.wikimedia.org/T381377)
[15:02:31] <wikibugs>	 (03CR) 10Sbisson: [C:03+1] Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) (owner: 10KartikMistry)
[15:02:35] <Amir1>	 jouncebot: nowandnext
[15:02:35] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 57 minute(s)
[15:02:35] <jouncebot>	 In 1 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1700)
[15:02:36] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:02:41] <Amir1>	 cool
[15:03:15] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:03:28] <logmsgbot>	 !log eevans@cumin1002 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[15:03:32] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[15:04:44] <jinxer-wm>	 RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[15:04:44] <jinxer-wm>	 RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[15:05:13] <jinxer-wm>	 FIRING: IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[15:05:14] <jinxer-wm>	 FIRING: IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[15:06:18] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:06:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: Add kartotherian (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:07:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102872 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup)
[15:07:35] <wikibugs>	 (03Merged) 10jenkins-bot: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add tigwiki to pre-install [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102872 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup)
[15:08:05] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1102872|Add tigwiki to pre-install (T381377)]]
[15:08:10] <stashbot>	 T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377
[15:09:11] <logmsgbot>	 !log bking@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: 0.3.150
[15:09:17] <logmsgbot>	 !log bking@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: 0.3.150 (duration: 00m 05s)
[15:09:17] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:09:25] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[15:09:27] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[15:09:29] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[15:10:03] <wikibugs>	 (03PS1) 10Bking: wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150)
[15:11:38] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1102872|Add tigwiki to pre-install (T381377)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:12:06] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[15:15:59] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) (owner: 10KartikMistry)
[15:16:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1025:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[15:16:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150
[15:16:29] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] vrts: Update mail alias generation script to bail on too many changes [puppet] - 10https://gerrit.wikimedia.org/r/1097556 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney)
[15:16:30] <stashbot>	 T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150
[15:16:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150
[15:16:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[15:17:07] <wikibugs>	 (03Merged) 10jenkins-bot: Update recommendation-api to 2024-12-12-085930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102845 (https://phabricator.wikimedia.org/T381889) (owner: 10KartikMistry)
[15:17:41] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102872|Add tigwiki to pre-install (T381377)]] (duration: 09m 35s)
[15:17:44] <stashbot>	 T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377
[15:18:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[15:18:17] <wikibugs>	 (03CR) 10CDanis: [C:03+1] wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking)
[15:18:24] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking)
[15:19:02] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[15:19:31] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[15:19:38] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs1025: add as dsh target [puppet] - 10https://gerrit.wikimedia.org/r/1102874 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking)
[15:20:13] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[15:22:35] <wikibugs>	 (03PS1) 10Ladsgroup: Activate tigwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102879 (https://phabricator.wikimedia.org/T381377)
[15:22:50] <wikibugs>	 (03PS1) 10Hnowlan: Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881
[15:22:51] <wikibugs>	 (03PS1) 10Volans: admin: add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1102880
[15:23:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102879 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup)
[15:24:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[15:24:29] <wikibugs>	 (03Merged) 10jenkins-bot: Activate tigwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102879 (https://phabricator.wikimedia.org/T381377) (owner: 10Ladsgroup)
[15:24:45] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1102879|Activate tigwiki (T381377)]]
[15:24:49] <stashbot>	 T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377
[15:24:59] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[15:25:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Set profile::docker::builder::docker_pkg: false for build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102883 (https://phabricator.wikimedia.org/T379343)
[15:27:32] <wikibugs>	 (03PS2) 10Muehlenhoff: Set profile::docker::builder::docker_pkg: false for build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102883 (https://phabricator.wikimedia.org/T379343)
[15:28:00] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1102879|Activate tigwiki (T381377)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:28:38] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[15:29:18] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[15:30:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, this needs no further approval and you can self-merge" [puppet] - 10https://gerrit.wikimedia.org/r/1102880 (owner: 10Volans)
[15:30:34] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1102880 (owner: 10Volans)
[15:32:44] <wikibugs>	 (03PS1) 10Klausman: sre/ores: remove obsolete ORES cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259)
[15:34:05] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[15:34:10] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102879|Activate tigwiki (T381377)]] (duration: 09m 25s)
[15:34:14] <stashbot>	 T381377: Create Wikipedia Tigre - https://phabricator.wikimedia.org/T381377
[15:39:40] <wikibugs>	 (03PS1) 10Volans: sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259)
[15:39:45] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.misc-clusters.thumbor: Remove obsolete cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888
[15:40:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] sre.misc-clusters.thumbor: Remove obsolete cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 (owner: 10Muehlenhoff)
[15:40:30] <wikibugs>	 (03CR) 10Volans: "Proposing the removal of this old and half-baked cookbook. Adding the last known users to check if there is any objection." [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans)
[15:40:46] <wikibugs>	 (03PS1) 10Elukey: services: fix helmfile config for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102889
[15:40:59] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] analytics/html: update readme for MW history dump [puppet] - 10https://gerrit.wikimedia.org/r/1102848 (https://phabricator.wikimedia.org/T381390) (owner: 10Milimetric)
[15:43:06] <logmsgbot>	 !log bking@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: 0.3.150
[15:43:19] <logmsgbot>	 !log bking@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: 0.3.150 (duration: 00m 13s)
[15:43:31] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) (owner: 10Klausman)
[15:43:36] <wikibugs>	 (03PS9) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[15:44:14] <wikibugs>	 (03CR) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:46:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Good to go IMO" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans)
[15:46:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.misc-clusters.thumbor: Remove obsolete cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 (owner: 10Muehlenhoff)
[15:47:40] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, you can test it with the test-cookbook both in dry-run and real runs (I suggest to leave it log to SAL for awareness)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:48:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "Ran the steps from https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook post merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102888 (owner: 10Muehlenhoff)
[15:49:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Set profile::docker::builder::docker_pkg: false for build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1102883 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff)
[15:51:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: fix helmfile config for kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102889 (owner: 10Elukey)
[15:56:31] <wikibugs>	 (03PS2) 10DLynch: EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308)
[15:56:36] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[15:59:18] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan)
[15:59:34] <wikibugs>	 06SRE, 06serviceops: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10400945 (10JJMC89) Similar to {T359901}, which appears to be caused by the mitigation for {T341908}.
[15:59:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10400950 (10phaultfinder)
[16:00:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch)
[16:04:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan)
[16:04:25] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan)
[16:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "shellbox-video: scale up to 60 replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102881 (owner: 10Hnowlan)
[16:06:19] <wikibugs>	 (03PS10) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[16:06:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[16:06:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1270-1275].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[16:08:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1270.eqiad.wmnet with OS bookworm
[16:08:35] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1270
[16:08:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1270
[16:09:52] <wikibugs>	 (03CR) 10Herron: [C:03+2] "perfect thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1102366 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[16:13:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: add group approvers for druid-admins and htmldumps-admin [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[16:13:40] <wikibugs>	 (03PS1) 10Elukey: charts: update kartotherian's entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102894
[16:14:03] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] miscweb: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1102850 (owner: 10Muehlenhoff)
[16:16:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: update kartotherian's entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102894 (owner: 10Elukey)
[16:17:11] <wikibugs>	 (03PS1) 10Volans: tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258)
[16:17:17] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[16:18:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch)
[16:19:44] <wikibugs>	 (03CR) 10Volans: [C:03+2] cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:19:55] <wikibugs>	 (03PS3) 10DLynch: EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308)
[16:19:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "yep, https://commons-query.wikimedia.org/ still working" [puppet] - 10https://gerrit.wikimedia.org/r/1102850 (owner: 10Muehlenhoff)
[16:20:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:23:00] <wikibugs>	 (03PS1) 10Hnowlan: Revert "mw-videoscaler: lower concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102898
[16:27:21] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[16:27:41] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db2149 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:27:42] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db2194 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:27:51] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db1175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:27:55] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db2205 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:28:05] <herron>	 o_O
[16:28:08] <herron>	 !incidents
[16:28:09] <sirenbot>	 5537 (UNACKED)  db2149 (paged)/MariaDB Replica SQL: s3 (paged)
[16:28:09] <sirenbot>	 5538 (UNACKED)  db2194 (paged)/MariaDB Replica SQL: s3 (paged)
[16:28:09] <sirenbot>	 5539 (UNACKED)  db1175 (paged)/MariaDB Replica SQL: s3 (paged)
[16:28:10] <sirenbot>	 5540 (UNACKED)  db2205 (paged)/MariaDB Replica SQL: s3 (paged)
[16:28:10] <sirenbot>	 5536 (RESOLVED)  ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw)
[16:28:18] <Amir1>	 sigh
[16:28:22] <Amir1>	 this is fun
[16:28:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1270.eqiad.wmnet with reason: host reimage
[16:28:33] <herron>	 !ack 5537 5538
[16:28:33] <sirenbot>	 Could not ack the alert. Please check the parameters.
[16:28:35] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db2177 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:28:39] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db2190 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table templatelinks is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:28:52] <herron>	 !ack 5537
[16:28:52] <sirenbot>	 5537 (ACKED)  db2149 (paged)/MariaDB Replica SQL: s3 (paged)
[16:28:54] <herron>	 !ack 5538
[16:28:54] <sirenbot>	 5538 (ACKED)  db2194 (paged)/MariaDB Replica SQL: s3 (paged)
[16:28:56] <herron>	 !ack 5539
[16:28:57] <sirenbot>	 5539 (ACKED)  db1175 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:00] <herron>	 !ack 5540
[16:29:00] <sirenbot>	 5540 (ACKED)  db2205 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:05] <Amir1>	 give me a bit
[16:29:06] <herron>	 !incidents
[16:29:07] <Amir1>	 I fix this
[16:29:07] <sirenbot>	 5537 (ACKED)  db2149 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:07] <sirenbot>	 5538 (ACKED)  db2194 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:07] <sirenbot>	 5539 (ACKED)  db1175 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:07] <sirenbot>	 5540 (ACKED)  db2205 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:08] <sirenbot>	 5541 (UNACKED)  db2177 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:08] <sirenbot>	 5542 (UNACKED)  db2190 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:08] <sirenbot>	 5536 (RESOLVED)  ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw)
[16:29:24] <herron>	 Amir1: thanks!
[16:29:34] <herron>	 !ack 5541
[16:29:35] <sirenbot>	 5541 (ACKED)  db2177 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:35] <herron>	 !ack 5542
[16:29:35] <sirenbot>	 5542 (ACKED)  db2190 (paged)/MariaDB Replica SQL: s3 (paged)
[16:29:47] <arnoldokoth>	 Thanks Amir.
[16:30:45] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db2149 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:31:16] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1270.eqiad.wmnet with reason: host reimage
[16:31:36] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db2177 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:31:40] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db2190 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:31:45] <wikibugs>	 (03Merged) 10jenkins-bot: cookbook: add owner_team property [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100773 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:32:45] <herron>	 !incidents
[16:32:45] <Amir1>	 all fixed
[16:32:45] <sirenbot>	 5538 (ACKED)  db2194 (paged)/MariaDB Replica SQL: s3 (paged)
[16:32:45] <sirenbot>	 5539 (ACKED)  db1175 (paged)/MariaDB Replica SQL: s3 (paged)
[16:32:46] <sirenbot>	 5540 (ACKED)  db2205 (paged)/MariaDB Replica SQL: s3 (paged)
[16:32:46] <sirenbot>	 5542 (RESOLVED)  db2190 (paged)/MariaDB Replica SQL: s3 (paged)
[16:32:46] <sirenbot>	 5541 (RESOLVED)  db2177 (paged)/MariaDB Replica SQL: s3 (paged)
[16:32:46] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db2194 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:32:46] <sirenbot>	 5537 (RESOLVED)  db2149 (paged)/MariaDB Replica SQL: s3 (paged)
[16:32:46] <sirenbot>	 5536 (RESOLVED)  ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw)
[16:32:56] <herron>	 thanks much Amir1!
[16:34:33] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert "mw-videoscaler: lower concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102898 (owner: 10Hnowlan)
[16:35:33] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mw-videoscaler: lower concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102898 (owner: 10Hnowlan)
[16:36:05] <wikibugs>	 (03CR) 10Volans: [C:03+2] tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:37:38] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db1175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:37:44] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[16:37:48] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[16:39:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.restart
[16:39:13] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[16:39:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.restart
[16:41:41] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[16:42:59] <wikibugs>	 (03PS3) 10Hnowlan: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700)
[16:43:09] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[16:44:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.restart
[16:45:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:51] <wikibugs>	 (03Merged) 10jenkins-bot: tests: add test for the ownership field [cookbooks] - 10https://gerrit.wikimedia.org/r/1102896 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:46:33] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10401062 (10LSobanski) @Platonides could you help us verify this?
[16:47:11] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: generate names for mercurius jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[16:47:40] <wikibugs>	 (03PS1) 10Elukey: charts: add volume mount for /etc to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102901
[16:49:01] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: generate names for mercurius jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102811 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[16:49:15] <wikibugs>	 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401070 (10Clement_Goubert) p:05Triage→03High
[16:49:44] <wikibugs>	 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401073 (10Clement_Goubert)
[16:49:45] <wikibugs>	 06SRE, 06serviceops: blank 429 error attempting to create accounts on checkuserwiki - https://phabricator.wikimedia.org/T382048#10401076 (10Clement_Goubert) →14Duplicate dup:03T359901
[16:50:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1270.eqiad.wmnet with OS bookworm
[16:50:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:51:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[16:51:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1271.eqiad.wmnet with OS bookworm
[16:52:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1271
[16:52:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1271
[16:52:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[16:52:14] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[16:52:47] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[16:52:51] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[16:53:00] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: sync
[16:53:03] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: sync
[16:53:26] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: sync
[16:53:29] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: sync
[16:55:57] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: revert generateName behaviour in mercurius job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700)
[16:56:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10401093 (10Jhancock.wm)
[16:57:10] <wikibugs>	 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401106 (10Clement_Goubert) >>! In T359901#10400510, @Krd wrote: > The account was created in the meantime. I suggest I test this at the next opportunity and repo...
[16:57:21] <wikibugs>	 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401108 (10Clement_Goubert) 05Open→03In progress
[16:57:42] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db2205 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:58:38] <wikibugs>	 (03PS2) 10Elukey: charts: add volume mount for /etc to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102901
[16:58:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: revert generateName behaviour in mercurius job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[16:59:51] <wikibugs>	 (03PS2) 10Hnowlan: mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700)
[17:00:05] <jouncebot>	 jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:02:39] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:03:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:05:03] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: use release name to create unique jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102902 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:06:43] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: add volume mount for /etc to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102901 (owner: 10Elukey)
[17:08:41] <wikibugs>	 (03PS1) 10Bvibber: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102903 (https://phabricator.wikimedia.org/T382039)
[17:11:07] <bvibber>	 anybody doing any infrastructure/service deploys during this hour before the mw backport window?
[17:11:16] <bvibber>	 if not i'll go ahead and deploy that service update for chart-renderer
[17:11:23] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700)
[17:11:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1271.eqiad.wmnet with reason: host reimage
[17:12:07] <mutante>	 bvibber: at least the part that is on the deployment calendar, no, nothing. puppet window is empty
[17:12:10] <jinxer-wm>	 FIRING: ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1012-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:26] <bvibber>	 spiffy :D
[17:13:29] <wikibugs>	 (03PS1) 10DDesouza: Reader Survey: Deploy on eswiki, dewiki and frwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660)
[17:13:38] <mutante>	 jouncebot: nowandnext
[17:13:38] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1700)
[17:13:38] <jouncebot>	 In 0 hour(s) and 46 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800)
[17:13:39] <jouncebot>	 In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800)
[17:13:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:13:45] <wikibugs>	 06SRE, 06serviceops: HTTP 429 error on private wikis trying to create account via Special:CreateAccount - https://phabricator.wikimedia.org/T359901#10401161 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert The problem should be resolved for all private wikis now.
[17:13:49] <bvibber>	 !log doing service deploy for chart-renderer (T382039)
[17:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:53] <stashbot>	 T382039: chart-renderer services update - https://phabricator.wikimedia.org/T382039
[17:14:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1012-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:14:36] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:14:47] <wikibugs>	 (03CR) 10Bvibber: [C:03+2] "merging for deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102903 (https://phabricator.wikimedia.org/T382039) (owner: 10Bvibber)
[17:14:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[17:15:04] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:15:05] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1271.eqiad.wmnet with reason: host reimage
[17:15:36] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[17:15:48] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add Blunderbuss firewall rule to GitLab runner set [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic)
[17:16:03] <wikibugs>	 (03Merged) 10jenkins-bot: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102903 (https://phabricator.wikimedia.org/T382039) (owner: 10Bvibber)
[17:16:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[17:17:25] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: truncate mercurius job names at 63 chars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102904 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:18:16] <wikibugs>	 (03PS2) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[17:18:25] <logmsgbot>	 !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[17:18:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[17:19:08] <logmsgbot>	 !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[17:19:47] <logmsgbot>	 !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[17:20:12] <wikibugs>	 (03PS3) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[17:20:18] <logmsgbot>	 !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[17:20:26] <logmsgbot>	 !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[17:20:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10401197 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:21:02] <logmsgbot>	 !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[17:21:26] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700)
[17:21:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:22:05] <wikibugs>	 (03PS2) 10Hnowlan: mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700)
[17:22:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[17:23:28] <bvibber>	 !log charts-renderer deployment T382039 complete
[17:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:32] <stashbot>	 T382039: chart-renderer services update - https://phabricator.wikimedia.org/T382039
[17:25:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[17:27:09] <wikibugs>	 (03PS4) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[17:31:24] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4678/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[17:31:37] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:32:45] <wikibugs>	 (03PS5) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[17:32:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:33:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1271.eqiad.wmnet with OS bookworm
[17:34:42] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: truncate job names properly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102906 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[17:35:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1272.eqiad.wmnet with OS bookworm
[17:35:41] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1272
[17:35:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1272
[17:37:04] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[17:37:10] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[17:37:25] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4679/console" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[17:38:53] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c2d7e08]: Backfill pageview actor hourly 2024 12
[17:41:01] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[17:41:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[17:41:29] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[17:41:42] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[17:41:56] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@c2d7e08]: Backfill pageview actor hourly 2024 12 (duration: 03m 03s)
[17:51:11] <wikibugs>	 (03PS1) 10Michael Große: beta: enable updating link-suggestions from read-mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536)
[17:53:57] <ottomata>	 !log killing wikidatawiki xml dump process to try to unstick it - T382084
[17:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:01] <stashbot>	 T382084: 20241201 wikidatawiki xml dump not progressing - https://phabricator.wikimedia.org/T382084
[17:54:25] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[17:54:32] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[17:54:39] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.01e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[17:55:46] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1272.eqiad.wmnet with reason: host reimage
[17:58:13] <wikibugs>	 (03PS1) 10Gmodena: dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322)
[17:58:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1272.eqiad.wmnet with reason: host reimage
[17:58:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:00:05] <jouncebot>	 bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T1800)
[18:00:33] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: try to preserve superseded mercurius instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102915 (https://phabricator.wikimedia.org/T371700)
[18:08:47] <James_F>	 !log Running `mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType Z4 --report --verbose`
[18:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:17:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1272.eqiad.wmnet with OS bookworm
[18:19:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1273.eqiad.wmnet with OS bookworm
[18:19:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1273
[18:19:20] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1273
[18:22:27] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans)
[18:22:27] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[18:22:32] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[18:24:01] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] IDM update to Bitu 0.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1102762 (owner: 10Slyngshede)
[18:28:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:29:12] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: try to preserve superseded mercurius instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102915 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[18:34:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10401485 (10Scott_French) @Ammarpad - FYI, @thcipriani is out this week, so the next update here will likely be next week. Thanks for your patience.
[18:38:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[18:39:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage
[18:42:58] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage
[18:45:58] <wikibugs>	 (03CR) 10SBassett: [C:03+1] Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza)
[19:02:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1273.eqiad.wmnet with OS bookworm
[19:02:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:03:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1274.eqiad.wmnet with OS bookworm
[19:04:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1274
[19:04:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1274
[19:12:00] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10401554 (10BCornwall) 05Stalled→03In progress p:05Triage→03High
[19:12:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Remove defunct lvs cross-dc links in Netbox (lvs2011 & lvs2013) - https://phabricator.wikimedia.org/T381533#10401561 (10Papaul) a:03Papaul
[19:15:15] <wikibugs>	 (03PS6) 10Scott French: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040)
[19:15:16] <wikibugs>	 (03PS6) 10Scott French: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040)
[19:15:16] <wikibugs>	 (03PS6) 10Scott French: mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040)
[19:15:16] <wikibugs>	 (03PS6) 10Scott French: mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040)
[19:15:17] <wikibugs>	 (03PS2) 10Scott French: mw-(apt-ext|api-int|jobrunner|parsoid|web): set php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040)
[19:15:40] <wikibugs>	 (03PS5) 10Scott French: hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040)
[19:15:40] <wikibugs>	 (03PS4) 10Scott French: hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040)
[19:15:40] <wikibugs>	 (03PS2) 10Scott French: hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040)
[19:16:30] <swfrench-wmf>	 jouncebot: nowandnext
[19:16:30] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 43 minute(s)
[19:16:30] <jouncebot>	 In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T2100)
[19:18:02] <swfrench-wmf>	 unless there are any objections, I might merge some changes shortly that will require a no-diff scap deployment to actuate
[19:20:13] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[19:20:51] <wikibugs>	 (03CR) 10Dzahn: "is it expected that https://query-scholarly.wikidata.org/querybuilder is a 404?" [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[19:21:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "looks like a noop, yea" [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[19:22:31] <swfrench-wmf>	 moving head
[19:22:39] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[19:23:47] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[19:24:08] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1274.eqiad.wmnet with reason: host reimage
[19:25:14] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[19:27:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1274.eqiad.wmnet with reason: host reimage
[19:27:48] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[19:27:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[19:30:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[19:30:34] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[19:31:09] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[19:33:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10401633 (10MoritzMuehlenhoff)
[19:35:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1102762 (owner: 10Slyngshede)
[19:40:18] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Deployment to populate mw-api-int migration release files - T377040
[19:40:23] <stashbot>	 T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040
[19:42:32] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Deployment to populate mw-api-int migration release files - T377040 (duration: 02m 13s)
[19:43:36] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[19:44:40] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French)
[19:46:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1274.eqiad.wmnet with OS bookworm
[19:48:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1275.eqiad.wmnet with OS bookworm
[19:48:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1275
[19:48:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1275
[19:48:42] <swfrench-wmf>	 all done for now on my end
[19:53:48] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[20:21:36] <wikibugs>	 (03PS1) 10CDanis: chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081)
[20:29:17] <wikibugs>	 (03PS2) 10CDanis: chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081)
[20:29:17] <wikibugs>	 (03PS1) 10CDanis: chart-renderer: probe the service [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081)
[20:29:26] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis)
[20:32:56] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1275.eqiad.wmnet with OS bookworm
[20:32:56] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1270-1275].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[20:37:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1275.eqiad.wmnet with OS bookworm
[20:37:22] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1275
[20:37:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1275
[20:38:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker127[6-7].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[20:40:41] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1276.eqiad.wmnet with OS bookworm
[20:40:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1276
[20:40:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1276
[20:41:26] <wikibugs>	 (03PS1) 10Cwhite: change metric types back to counters [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050)
[20:42:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) (owner: 10Cwhite)
[20:43:45] <wikibugs>	 (03PS2) 10CDanis: chart-renderer: probe the service [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081)
[20:43:45] <wikibugs>	 (03PS3) 10CDanis: chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081)
[20:43:52] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis)
[20:44:03] <icinga-wm>	 PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:47:15] <wikibugs>	 (03CR) 10CDanis: [C:03+2] chart-renderer: probe the service [puppet] - 10https://gerrit.wikimedia.org/r/1102958 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis)
[20:48:55] <wikibugs>	 (03PS1) 10Jasmine: wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961
[20:56:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1275.eqiad.wmnet with reason: host reimage
[20:59:35] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis)
[21:00:03] <wikibugs>	 (03CR) 10CDanis: [C:03+2] chart-renderer: enable probedown paging [puppet] - 10https://gerrit.wikimedia.org/r/1102955 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241212T2100).
[21:00:05] <jouncebot>	 kemayo, danisztls, and cwhite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:06] <danisztls>	 o/
[21:00:13] <Kemayo>	 o7
[21:00:15] <cwhite>	 o/
[21:00:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage
[21:02:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1275.eqiad.wmnet with reason: host reimage
[21:02:19] * MichaelG_WMF is also around
[21:05:31] <wikibugs>	 (03PS2) 10Jasmine: wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842)
[21:05:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage
[21:10:39] <Kemayo>	 I can go start pinging some slack channels to see if we can rustle up a deployer, I guess...
[21:13:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:15:28] <Kemayo>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: anyone actually here to run the window?
[21:19:28] <tgr|away>	 I can do it
[21:19:58] <Kemayo>	 tgr|away: Awesome, thanks!
[21:20:14] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch)
[21:20:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1275.eqiad.wmnet with OS bookworm
[21:21:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:22:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Reader Survey: Deploy on eswiki, dewiki and frwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:22:32] <RoanKattouw>	 Sorry about that, it's my first day back from leave and I missed the ping 
[21:24:07] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "@aotto@wikimedia.org - This LGTM in the "does what it says on the tin" sense" [puppet] - 10https://gerrit.wikimedia.org/r/1063224 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[21:24:16] <Kemayo>	 RoanKattouw: no worries
[21:24:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:25:04] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Add otto to analytics-admins posix user group - https://phabricator.wikimedia.org/T382099 (10Ottomata) 03NEW
[21:25:05] <icinga-wm>	 RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:25:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1276.eqiad.wmnet with OS bookworm
[21:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: Reader Survey: Deploy on eswiki, dewiki and frwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102905 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:25:19] <wikibugs>	 (03PS1) 10Ottomata: Add otto to analytics-admins posix user group [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099)
[21:25:22] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1277.eqiad.wmnet with OS bookworm
[21:25:37] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1102905|Reader Survey: Deploy on eswiki, dewiki and frwiki. (T378660)]]
[21:25:41] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:25:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1277
[21:25:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1277
[21:25:42] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add otto to analytics-admins posix user group - https://phabricator.wikimedia.org/T382099#10401795 (10Ottomata) I should already be in this group.  Proceeding to add myself.
[21:29:05] <icinga-wm>	 PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:29:47] <danisztls>	 tgr|away: thanks, looks good
[21:30:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add otto to analytics-admins posix user group [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) (owner: 10Ottomata)
[21:32:19] <wikibugs>	 (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) (owner: 10Ottomata)
[21:32:34] <logmsgbot>	 !log tgr@deploy2002 dani, tgr: Backport for [[gerrit:1102905|Reader Survey: Deploy on eswiki, dewiki and frwiki. (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:32:39] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:32:52] <inflatador>	 !log bking@gitlab-runner2004 restart docker to troubleshoot missing iptables rules T371994
[21:32:55] <logmsgbot>	 !log tgr@deploy2002 dani, tgr: Continuing with sync
[21:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:55] <stashbot>	 T371994: Deploy the HDFS synchronizer (blunderbuss) service to the dse-k8s cluster - https://phabricator.wikimedia.org/T371994
[21:35:24] <inflatador>	 !log bking@gitlab-runner2004 restart ferm to troubleshoot missing iptables rules T371994
[21:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:19] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102905|Reader Survey: Deploy on eswiki, dewiki and frwiki. (T378660)]] (duration: 12m 42s)
[21:38:23] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:39:40] <wikibugs>	 (03Merged) 10jenkins-bot: EditCheck: move checks to a sidebar [extensions/VisualEditor] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102885 (https://phabricator.wikimedia.org/T341308) (owner: 10DLynch)
[21:44:49] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] change metric types back to counters [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) (owner: 10Cwhite)
[21:45:43] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1102885|EditCheck: move checks to a sidebar (T341308 T379443)]]
[21:45:49] <stashbot>	 T341308: Present people with multiple reference checks when warranted  - https://phabricator.wikimedia.org/T341308
[21:45:49] <stashbot>	 T379443: Hide Vector 2022 Tools and Appearance menus when Edit Check has the potential to activate - https://phabricator.wikimedia.org/T379443
[21:46:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage
[21:46:18] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Add otto to analytics-admins posix user group [puppet] - 10https://gerrit.wikimedia.org/r/1102970 (https://phabricator.wikimedia.org/T382099) (owner: 10Ottomata)
[21:47:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add otto to analytics-admins posix user group - https://phabricator.wikimedia.org/T382099#10401827 (10Ottomata) 05Open→03Resolved
[21:49:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage
[21:51:48] <wikibugs>	 (03PS1) 10Jforrester: Provide a base iamge for Rust 1.63, based on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807)
[22:02:16] <logmsgbot>	 !log tgr@deploy2002 tgr, kemayo: Backport for [[gerrit:1102885|EditCheck: move checks to a sidebar (T341308 T379443)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:02:25] <stashbot>	 T341308: Present people with multiple reference checks when warranted  - https://phabricator.wikimedia.org/T341308
[22:02:25] <stashbot>	 T379443: Hide Vector 2022 Tools and Appearance menus when Edit Check has the potential to activate - https://phabricator.wikimedia.org/T379443
[22:03:00] <Kemayo>	 tgr|away: I've checked on deploy2002 and it all looks good.
[22:03:51] <logmsgbot>	 !log tgr@deploy2002 tgr, kemayo: Continuing with sync
[22:04:35] <wikibugs>	 (03Merged) 10jenkins-bot: change metric types back to counters [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102959 (https://phabricator.wikimedia.org/T374050) (owner: 10Cwhite)
[22:06:38] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission 1 host [puppet] - 10https://gerrit.wikimedia.org/r/1102961 (https://phabricator.wikimedia.org/T375842) (owner: 10Jasmine)
[22:08:02] <inflatador>	 !log bking@cumin2002 sudo cumin A:gitlab-runner 'systemctl restart ferm.service' T371994
[22:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:06] <stashbot>	 T371994: Deploy the HDFS synchronizer (blunderbuss) service to the dse-k8s cluster - https://phabricator.wikimedia.org/T371994
[22:09:05] <icinga-wm>	 RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:09:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1277.eqiad.wmnet with OS bookworm
[22:09:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker127[6-7].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[22:09:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:14:56] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102885|EditCheck: move checks to a sidebar (T341308 T379443)]] (duration: 29m 12s)
[22:15:01] <stashbot>	 T341308: Present people with multiple reference checks when warranted  - https://phabricator.wikimedia.org/T341308
[22:15:01] <stashbot>	 T379443: Hide Vector 2022 Tools and Appearance menus when Edit Check has the potential to activate - https://phabricator.wikimedia.org/T379443
[22:16:27] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1102959|change metric types back to counters (T374050)]]
[22:16:30] <stashbot>	 T374050: Migrate GrowthExperiments.NewcomerTask Module to statslib - https://phabricator.wikimedia.org/T374050
[22:20:26] <logmsgbot>	 !log tgr@deploy2002 tgr, cwhite: Backport for [[gerrit:1102959|change metric types back to counters (T374050)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:27:47] <tgr|away>	 cwhite: do you want to check on mwdebug?
[22:28:02] <cwhite>	 tgr|away: mwscript finished, looks good!
[22:30:20] <logmsgbot>	 !log tgr@deploy2002 tgr, cwhite: Continuing with sync
[22:35:37] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102959|change metric types back to counters (T374050)]] (duration: 19m 10s)
[22:35:41] <stashbot>	 T374050: Migrate GrowthExperiments.NewcomerTask Module to statslib - https://phabricator.wikimedia.org/T374050
[22:35:55] <cwhite>	 Thank you!
[22:36:21] <tgr|away>	 !log UTC late deploys done
[22:36:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[22:41:02] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10401925 (10Volans)
[22:41:19] <wikibugs>	 (03PS1) 10Mstyles: security-landing-page: deploying updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102989 (https://phabricator.wikimedia.org/T381430)
[23:20:13] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[23:25:14] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[23:46:49] <icinga-wm>	 RECOVERY - MD RAID on aqs1014 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[23:58:31] <jinxer-wm>	 RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards