[00:13:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10404963 (10phaultfinder) [00:25:26] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104401 [00:38:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104401 (owner: 10TrainBranchBot) [00:38:04] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219 (10Papaul) 03NEW [00:40:35] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10404986 (10Papaul) [00:42:38] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10404987 (10Papaul) [00:52:34] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104401 (owner: 10TrainBranchBot) [01:09:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1104402 [01:10:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1104402 (owner: 10TrainBranchBot) [01:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405011 (10phaultfinder) [01:30:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1104402 (owner: 10TrainBranchBot) [01:45:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [01:49:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405017 (10phaultfinder) [02:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405022 (10phaultfinder) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:20] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.032e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [03:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [04:13:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405051 (10phaultfinder) [04:44:20] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 6482 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:53:00] (03PS4) 10Anzx: tigwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103240 (https://phabricator.wikimedia.org/T381379) [05:54:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103240 (https://phabricator.wikimedia.org/T381379) (owner: 10Anzx) [05:55:08] (03PS2) 10Anzx: tigwiki: add SITENAME, timezone and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103159 (https://phabricator.wikimedia.org/T381379) [05:55:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103159 (https://phabricator.wikimedia.org/T381379) (owner: 10Anzx) [06:06:04] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 216739160 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:07:04] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 19328 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:54:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [07:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405106 (10phaultfinder) [07:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [07:39:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405108 (10phaultfinder) [07:43:24] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2109.codfw.wmnet [07:44:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2109.codfw.wmnet [07:44:30] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2108.codfw.wmnet [07:45:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2108.codfw.wmnet [07:48:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2108.codfw.wmnet with OS bookworm [07:48:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2109.codfw.wmnet with OS bookworm [07:48:23] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2109 [07:48:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2109 [07:48:34] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2108 [07:48:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2108 [07:50:33] (03CR) 10أنون: [C:03+1] "It was a mistake, I told them to unschedule it in IRC during the deployment task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [07:52:14] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:53:54] (03CR) 10أنون: [C:03+1] [enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T0800). nyaa~ [08:00:05] hubaishan, lolekek, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:14] o/ [08:02:20] o/ [08:04:37] hello [08:05:01] anzx: lolekek: I will deploy the patches [08:05:10] ok [08:05:13] thank you! [08:06:20] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2109.codfw.wmnet with reason: host reimage [08:06:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [08:07:34] (03Merged) 10jenkins-bot: [enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [08:08:33] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1101867|[enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 (T381421)]] [08:08:36] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2108.codfw.wmnet with reason: host reimage [08:08:36] T381421: Change default license on en.wikinews and pl.wikinews to cc-by-4.0 on December 16, 2024 - https://phabricator.wikimedia.org/T381421 [08:09:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2109.codfw.wmnet with reason: host reimage [08:10:25] anzx: looks like I will do both of your patches at the same time [08:10:34] (03CR) 10Muehlenhoff: [C:03+2] webperf: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1103318 (owner: 10Muehlenhoff) [08:10:46] hashar: nice [08:10:47] Thanks hashar! [08:10:57] and https://phabricator.wikimedia.org/T381379#10404723 states the interwiki tig: does not work [08:11:11] I will refresh the cache once the two patches have been deployed [08:11:51] Amir1 said they would do it today, if you could then np [08:12:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2108.codfw.wmnet with reason: host reimage [08:13:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:15:15] (03CR) 10Slyngshede: [C:03+2] Finetune request dialogue [software/bitu] - 10https://gerrit.wikimedia.org/r/1103300 (owner: 10Muehlenhoff) [08:15:27] (03CR) 10Slyngshede: [C:03+2] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1103300 (owner: 10Muehlenhoff) [08:19:54] that takes a while [08:21:45] !log hashar@deploy2002 hashar, anwon: Backport for [[gerrit:1101867|[enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 (T381421)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:21:49] T381421: Change default license on en.wikinews and pl.wikinews to cc-by-4.0 on December 16, 2024 - https://phabricator.wikimedia.org/T381421 [08:22:02] !log hashar@deploy2002 hashar, anwon: Continuing with sync [08:22:27] (03Merged) 10jenkins-bot: Finetune request dialogue [software/bitu] - 10https://gerrit.wikimedia.org/r/1103300 (owner: 10Muehlenhoff) [08:22:27] (03Merged) 10jenkins-bot: Allow the comment to be left empty in permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1103303 (owner: 10Muehlenhoff) [08:23:15] I don't understand why we had a fully brand new image built [08:23:16] docker-registry.wikimedia.org/mediawiki-httpd latest e8e9c0915e1f 25 hours ago 175MB [08:23:17] oh [08:23:31] so the base image got reubild for some reason on Sunday [08:23:35] which invalidates the whole chain [08:23:41] of layers [08:24:22] and that is probably the primary explanation for the 20+ minutes long backport [08:28:29] !log restarting blazegraph on wdqs2017 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2109.codfw.wmnet with OS bookworm [08:30:26] 08:30:11 K8s deployment progress: 67% (ok: 1633; fail: 0; left: 803) \ [08:30:27] :b [08:32:24] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101867|[enwikinews] & [hewikinews] & [plwikinews]: Upgrade license to CC BY 4.0 (T381421)]] (duration: 23m 51s) [08:32:28] T381421: Change default license on en.wikinews and pl.wikinews to cc-by-4.0 on December 16, 2024 - https://phabricator.wikimedia.org/T381421 [08:33:19] lolekek: the wikinews should show CC BY 4.0 now :) [08:33:26] anzx: I am doing your two patches now [08:33:37] Thank you hashar, lgtm with wmdebug! [08:33:53] ok [08:34:08] lolekek: oh I have sent them straight to prod :) [08:34:18] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:34:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2108.codfw.wmnet with OS bookworm [08:34:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103240 (https://phabricator.wikimedia.org/T381379) (owner: 10Anzx) [08:34:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103159 (https://phabricator.wikimedia.org/T381379) (owner: 10Anzx) [08:35:20] (03Merged) 10jenkins-bot: tigwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103240 (https://phabricator.wikimedia.org/T381379) (owner: 10Anzx) [08:35:22] (03Merged) 10jenkins-bot: tigwiki: add SITENAME, timezone and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103159 (https://phabricator.wikimedia.org/T381379) (owner: 10Anzx) [08:35:41] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1103240|tigwiki: add logos (T381379)]], [[gerrit:1103159|tigwiki: add SITENAME, timezone and projectnamespace (T381379)]] [08:35:45] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [08:38:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:39:28] !log hashar@deploy2002 anzx, hashar: Backport for [[gerrit:1103240|tigwiki: add logos (T381379)]], [[gerrit:1103159|tigwiki: add SITENAME, timezone and projectnamespace (T381379)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:39:40] ah [08:39:42] hashar: testing [08:40:15] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2108.codfw.wmnet [08:40:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2108.codfw.wmnet [08:40:36] hashar: looks good [08:40:41] !log hashar@deploy2002 anzx, hashar: Continuing with sync [08:40:48] thank you for the test [08:41:01] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2109.codfw.wmnet [08:41:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2109.codfw.wmnet [08:41:06] (03CR) 10Muehlenhoff: [C:03+2] yarn: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100476 (owner: 10Muehlenhoff) [08:41:23] I have never ever heard of that language (Tigre / https://en.wikipedia.org/wiki/Tigre_language ) [08:41:39] which I guess is the benefit of doing the configuration of new wikis, one discovers new languages :) [08:44:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2107.codfw.wmnet [08:44:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2107.codfw.wmnet [08:45:02] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2102.codfw.wmnet [08:45:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2102.codfw.wmnet [08:47:19] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1103240|tigwiki: add logos (T381379)]], [[gerrit:1103159|tigwiki: add SITENAME, timezone and projectnamespace (T381379)]] (duration: 11m 37s) [08:47:23] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [08:47:46] anzx: that should be live in production now [08:47:48] both patches [08:48:00] hashar: thank you [08:48:24] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2102.codfw.wmnet with OS bookworm [08:48:25] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2107.codfw.wmnet with OS bookworm [08:48:45] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2107 [08:48:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2107 [08:48:45] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2102 [08:48:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2102 [08:51:54] (03PS1) 10Hashar: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104593 (https://phabricator.wikimedia.org/T381379) [08:52:08] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:52:20] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:52:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104593 (https://phabricator.wikimedia.org/T381379) (owner: 10Hashar) [08:52:32] anzx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1104593 will update the interwikis [08:53:05] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104593 (https://phabricator.wikimedia.org/T381379) (owner: 10Hashar) [08:53:24] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1104593|Update interwiki cache (T381379)]] [08:53:25] there is a last patch [config] 1104392 (deploy commands) [arwikisource] Enable the SandboxLink extension - task T382218 [08:53:28] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [08:53:28] T382218: install Extension:SandboxLink on arwikisource - https://phabricator.wikimedia.org/T382218 [08:55:23] 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10405230 (10fgiunchedi) Thank you @Scott_French! I'm happy to help with ops-maint-gcal.js changes, feel free to send reviews my way [08:56:33] (03CR) 10Filippo Giunchedi: [C:03+1] thanos: query-frontend: set labels.response-cache-config in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1103364 (owner: 10Herron) [08:56:45] (03CR) 10Filippo Giunchedi: [C:03+1] thanos: query-frontend: enable query-range.align-range-with-step [puppet] - 10https://gerrit.wikimedia.org/r/1103365 (owner: 10Herron) [08:56:50] (03CR) 10Filippo Giunchedi: [C:03+1] thanos: query-frontend: remove max_item_size cache setting [puppet] - 10https://gerrit.wikimedia.org/r/1103352 (owner: 10Herron) [08:57:38] !log hashar@deploy2002 hashar: Backport for [[gerrit:1104593|Update interwiki cache (T381379)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:57:46] hashar: interwiki works now with wmdebug [08:58:16] <3 [08:58:31] !log hashar@deploy2002 hashar: Continuing with sync [08:58:44] thank you for the verification! [09:00:26] PROBLEM - Check whether ferm is active by checking the default input chain on mw2356 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:00:36] PROBLEM - Check whether ferm is active by checking the default input chain on mw2371 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:00:36] PROBLEM - Check whether ferm is active by checking the default input chain on mw2336 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:04:01] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104593|Update interwiki cache (T381379)]] (duration: 10m 36s) [09:04:05] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [09:05:45] hashar: thanks again, for fixing interwiki cache [09:06:07] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2107.codfw.wmnet with reason: host reimage [09:07:05] anzx: thank you very much for your assistance! [09:07:38] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2102.codfw.wmnet with reason: host reimage [09:09:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2107.codfw.wmnet with reason: host reimage [09:10:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104392 (https://phabricator.wikimedia.org/T382218) (owner: 10Hubaishan) [09:11:42] (03Merged) 10jenkins-bot: [arwikisource] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104392 (https://phabricator.wikimedia.org/T382218) (owner: 10Hubaishan) [09:11:59] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1104392|[arwikisource] Enable the SandboxLink extension (T382218)]] [09:12:04] T382218: install Extension:SandboxLink on arwikisource - https://phabricator.wikimedia.org/T382218 [09:12:27] (03PS1) 10Slyngshede: Disallow emails as username [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) [09:13:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2102.codfw.wmnet with reason: host reimage [09:16:24] !log hashar@deploy2002 hashar, hubaishan: Backport for [[gerrit:1104392|[arwikisource] Enable the SandboxLink extension (T382218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:16:58] !log hashar@deploy2002 hashar, hubaishan: Continuing with sync [09:22:23] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104392|[arwikisource] Enable the SandboxLink extension (T382218)]] (duration: 10m 23s) [09:22:27] T382218: install Extension:SandboxLink on arwikisource - https://phabricator.wikimedia.org/T382218 [09:25:20] (03CR) 10Muehlenhoff: [C:03+2] Add ferm macro/nftables set for loadbalancer nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [09:29:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2107.codfw.wmnet with OS bookworm [09:29:29] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:21] RECOVERY - Check whether ferm is active by checking the default input chain on mw2356 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:30:31] RECOVERY - Check whether ferm is active by checking the default input chain on mw2371 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:30:31] RECOVERY - Check whether ferm is active by checking the default input chain on mw2336 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:31:05] (03PS2) 10Muehlenhoff: cloudcontrol/codfw1dev:: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) [09:31:31] (03PS3) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556 [09:32:12] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2102.codfw.wmnet with OS bookworm [09:37:52] (03PS4) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556 [09:38:34] !log UTC morning backport window has been completed [09:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:22] (03CR) 10Muehlenhoff: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [09:44:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [09:56:35] (03Abandoned) 10Muehlenhoff: Switch idp_test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff) [10:14:18] (03PS1) 10DCausse: cirrussearch: increase shard count for cebwiki_content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104598 (https://phabricator.wikimedia.org/T379002) [10:15:33] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[36-39] to wikikube-worker20(47|66|85|86) [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [10:24:04] (03CR) 10Muehlenhoff: "Looks good, two nits inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) (owner: 10Slyngshede) [10:25:38] (03CR) 10DCausse: [C:03+1] opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [10:25:47] (03CR) 10DCausse: [C:03+1] opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [10:31:00] (03PS2) 10Slyngshede: Disallow emails as username [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) [10:31:53] (03CR) 10Slyngshede: Disallow emails as username (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) (owner: 10Slyngshede) [10:34:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) (owner: 10Slyngshede) [10:34:47] (03PS1) 10Elukey: profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) [10:43:48] (03PS2) 10Elukey: profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) [10:43:57] (03PS1) 10Btullis: Move some cephosd hieradata into profile default files [puppet] - 10https://gerrit.wikimedia.org/r/1104603 (https://phabricator.wikimedia.org/T378735) [10:44:46] (03CR) 10Urbanecm: [C:04-1] "issue: the variable is not present in ext-GrowthExperiments.php, which makes it harder to notice one shouldn't enable it in production. ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [10:44:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4684/console" [puppet] - 10https://gerrit.wikimedia.org/r/1104603 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [10:48:15] (03CR) 10Hnowlan: [C:03+1] mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [10:50:45] !log installing postgresql-15 security updates [10:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:20] (03PS2) 10Michael Große: beta: enable updating link-suggestions from read-mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) [10:51:27] (03CR) 10Michael Große: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [10:51:45] (03CR) 10Hnowlan: [C:03+2] mediawiki: try to preserve superseded mercurius instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102915 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [10:54:12] (03Merged) 10jenkins-bot: mediawiki: try to preserve superseded mercurius instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102915 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [10:54:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:34] (03PS2) 10Btullis: Move some cephosd hieradata into profile default files [puppet] - 10https://gerrit.wikimedia.org/r/1104603 (https://phabricator.wikimedia.org/T378735) [10:55:39] (03CR) 10Urbanecm: [C:03+1] "LGTM, thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [10:56:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4685/console" [puppet] - 10https://gerrit.wikimedia.org/r/1104603 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [10:56:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2107.codfw.wmnet [10:56:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2107.codfw.wmnet [10:56:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2102.codfw.wmnet [10:56:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2102.codfw.wmnet [10:57:32] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [10:57:37] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [10:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405633 (10phaultfinder) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1100) [11:02:31] (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1104603 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [11:03:05] (03PS3) 10Elukey: profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) [11:07:05] (03CR) 10Btullis: [V:03+1 C:03+2] Move some cephosd hieradata into profile default files [puppet] - 10https://gerrit.wikimedia.org/r/1104603 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [11:08:04] (03PS4) 10Elukey: profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) [11:09:19] (03PS1) 10Giuseppe Lavagetto: mediawiki: Add support for dumps persistent _volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) [11:14:52] !log installing NSS security updates [11:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:29] (03PS5) 10Elukey: profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) [11:17:39] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:19:14] (03PS6) 10Elukey: profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) [11:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [11:20:26] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [11:27:45] (03CR) 10Btullis: mediawiki: Add support for dumps persistent _volumes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto) [11:40:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102733 (https://phabricator.wikimedia.org/T380928) (owner: 10KartikMistry) [11:41:26] (03CR) 10Clément Goubert: mediawiki: Add support for dumps persistent _volumes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto) [11:48:34] (03CR) 10Jgiannelos: profile::maps::osm_master: allow kartotherian to run on Wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:50:38] (03CR) 10Slyngshede: [C:03+2] Disallow emails as username [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) (owner: 10Slyngshede) [11:54:28] (03Merged) 10jenkins-bot: Disallow emails as username [software/bitu] - 10https://gerrit.wikimedia.org/r/1104594 (https://phabricator.wikimedia.org/T382226) (owner: 10Slyngshede) [12:11:29] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2101.codfw.wmnet [12:12:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2101.codfw.wmnet [12:12:22] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2100.codfw.wmnet [12:12:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2100.codfw.wmnet [12:13:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [12:13:55] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2100.codfw.wmnet with OS bookworm [12:13:56] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2101.codfw.wmnet with OS bookworm [12:14:15] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2100 [12:14:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2100 [12:14:15] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2101 [12:14:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2101 [12:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405844 (10phaultfinder) [12:16:34] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104612 (owner: 10L10n-bot) [12:17:35] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: reimage puppetmasters to puppetservers - https://phabricator.wikimedia.org/T345067#10405847 (10MoritzMuehlenhoff) 05In progress→03Resolved This is complete, the remaining puppetmaster* hosts are very old (procured in 2016/2018) and will be... [12:18:13] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:21:34] (03PS1) 10Btullis: cephosd: enable the deployment of client cephx keys and minimal ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) [12:21:55] (03CR) 10CI reject: [V:04-1] cephosd: enable the deployment of client cephx keys and minimal ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [12:22:24] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4692/console" [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [12:23:58] (03PS2) 10Btullis: cephosd: enable the deployment of client cephx keys and minimal ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) [12:24:18] (03CR) 10CI reject: [V:04-1] cephosd: enable the deployment of client cephx keys and minimal ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [12:31:37] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2100.codfw.wmnet with reason: host reimage [12:31:54] (03PS3) 10Btullis: cephosd: enable the deployment of client cephx keys and minimal ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) [12:32:03] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2101.codfw.wmnet with reason: host reimage [12:34:19] (03PS1) 10Btullis: cephosd: move the auth keydata into profile default [labs/private] - 10https://gerrit.wikimedia.org/r/1104620 (https://phabricator.wikimedia.org/T378735) [12:35:00] (03CR) 10Btullis: [V:03+2 C:03+2] cephosd: move the auth keydata into profile default [labs/private] - 10https://gerrit.wikimedia.org/r/1104620 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [12:35:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2100.codfw.wmnet with reason: host reimage [12:38:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2101.codfw.wmnet with reason: host reimage [12:39:53] (03PS1) 10Btullis: ml-lab: Add a cephx key and minimal ceph.conf to ml-lab servers [puppet] - 10https://gerrit.wikimedia.org/r/1104621 (https://phabricator.wikimedia.org/T378735) [12:41:06] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4693/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104621 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [12:41:31] (03PS2) 10Btullis: ml-lab: Add a cephx key and minimal ceph.conf to ml-lab servers [puppet] - 10https://gerrit.wikimedia.org/r/1104621 (https://phabricator.wikimedia.org/T378735) [12:43:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4694/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104621 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [12:44:56] (03PS1) 10Muehlenhoff: Switch contact address away from sre-foundation@wikimedia.org [software/bitu] - 10https://gerrit.wikimedia.org/r/1104623 [12:46:37] (03CR) 10Hnowlan: [C:03+1] profile::maps::osm_master: allow kartotherian to run on Wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [12:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405911 (10phaultfinder) [12:50:01] (03PS1) 10Michael Große: stats(surfacing): track link recommendation api recommendations [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) [12:50:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [12:52:19] (03CR) 10Hnowlan: [C:03+1] Enable canShellboxGetTempUrl on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104398 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [12:55:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2100.codfw.wmnet with OS bookworm [12:58:19] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:59:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2101.codfw.wmnet with OS bookworm [13:03:31] (03PS1) 10Muehlenhoff: Copy puppet git hooks to puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1104626 (https://phabricator.wikimedia.org/T365798) [13:05:39] (03CR) 10CI reject: [V:04-1] Copy puppet git hooks to puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1104626 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:06:25] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2101.codfw.wmnet [13:06:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2101.codfw.wmnet [13:06:49] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2100.codfw.wmnet [13:06:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2100.codfw.wmnet [13:07:45] (03PS1) 10Muehlenhoff: Puppetserver: Update hooks [puppet] - 10https://gerrit.wikimedia.org/r/1104627 (https://phabricator.wikimedia.org/T365798) [13:10:40] (03CR) 10CI reject: [V:04-1] stats(surfacing): track link recommendation api recommendations [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [13:10:41] (03PS2) 10DCausse: cirrussearch: increase shard count for cebwiki_content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104598 (https://phabricator.wikimedia.org/T379002) [13:12:14] (03PS1) 10Jelto: miscweb: bump design-landing-page version for .git folder fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104628 (https://phabricator.wikimedia.org/T382230) [13:13:05] (03PS59) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [13:13:32] (03PS1) 10Michael Große: Kick bundlesize out of package.json [extensions/Popups] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104629 (https://phabricator.wikimedia.org/T382192) [13:13:46] (03PS2) 10Muehlenhoff: Copy puppet git hooks to puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1104626 (https://phabricator.wikimedia.org/T365798) [13:14:22] (03CR) 10Michael Große: "This is needed to enable backporting to -wmf.6 today" [extensions/Popups] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104629 (https://phabricator.wikimedia.org/T382192) (owner: 10Michael Große) [13:14:48] (03CR) 10Elukey: [V:03+1] profile::maps::osm_master: allow kartotherian to run on Wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [13:14:55] (03PS2) 10Michael Große: stats(surfacing): track link recommendation api recommendations [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) [13:16:43] !log dcausse@deploy2002 Started deploy [airflow-dags/search@c84bfa9]: search: add graph_name filtering [13:16:50] (03PS2) 10Muehlenhoff: Puppetserver: Update hooks [puppet] - 10https://gerrit.wikimedia.org/r/1104627 (https://phabricator.wikimedia.org/T365798) [13:17:13] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@c84bfa9]: search: add graph_name filtering (duration: 00m 30s) [13:18:08] (03PS1) 10Filippo Giunchedi: thanos: default to -15d for sidecar min_time [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) [13:18:09] (03PS1) 10Filippo Giunchedi: prometheus: refactor common functionality [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) [13:18:11] (03PS1) 10Filippo Giunchedi: pontoon: fix bootstrap and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/1104632 [13:19:03] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2099.codfw.wmnet [13:19:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2099.codfw.wmnet [13:19:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10405964 (10phaultfinder) [13:19:49] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2098.codfw.wmnet [13:19:50] (03PS1) 10Michael Große: fix(surfacing): Show highlights in lists as well [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104633 (https://phabricator.wikimedia.org/T381841) [13:20:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2098.codfw.wmnet [13:20:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2099.codfw.wmnet with OS bookworm [13:20:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2098.codfw.wmnet with OS bookworm [13:21:14] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2099 [13:21:14] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2098 [13:21:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2098 [13:21:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2099 [13:21:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104633 (https://phabricator.wikimedia.org/T381841) (owner: 10Michael Große) [13:21:23] (03CR) 10LSobanski: [C:03+1] miscweb: bump design-landing-page version for .git folder fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104628 (https://phabricator.wikimedia.org/T382230) (owner: 10Jelto) [13:21:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Popups] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104629 (https://phabricator.wikimedia.org/T382192) (owner: 10Michael Große) [13:21:55] (03CR) 10Jelto: [C:03+2] miscweb: bump design-landing-page version for .git folder fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104628 (https://phabricator.wikimedia.org/T382230) (owner: 10Jelto) [13:22:36] !log imported packages for mercurius 1.0.3 via reprepro [13:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:23] (03Merged) 10jenkins-bot: miscweb: bump design-landing-page version for .git folder fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104628 (https://phabricator.wikimedia.org/T382230) (owner: 10Jelto) [13:24:36] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:24:43] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:58] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:25:08] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:25:19] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:25:27] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:25:41] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:26:06] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:27:39] (03CR) 10Jforrester: Provide a base image for Rust, based on Bookworm using 'rustc-web' now at 1.78 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [13:32:13] (03PS1) 10Hnowlan: php8.1: bump images to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1104637 (https://phabricator.wikimedia.org/T371701) [13:33:34] (03CR) 10Clément Goubert: [C:03+1] php8.1: bump images to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1104637 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [13:37:16] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix bootstrap and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/1104632 (owner: 10Filippo Giunchedi) [13:37:44] (03PS2) 10Filippo Giunchedi: pontoon: fix bootstrap and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/1104632 [13:37:44] (03PS2) 10Filippo Giunchedi: thanos: default to -15d for sidecar min_time [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) [13:37:44] (03PS2) 10Filippo Giunchedi: prometheus: refactor common functionality [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) [13:38:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1104627 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:39:07] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2098.codfw.wmnet with reason: host reimage [13:39:26] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1104623 (owner: 10Muehlenhoff) [13:39:31] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2099.codfw.wmnet with reason: host reimage [13:39:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406020 (10phaultfinder) [13:40:03] (03CR) 10Muehlenhoff: [C:03+2] Switch contact address away from sre-foundation@wikimedia.org [software/bitu] - 10https://gerrit.wikimedia.org/r/1104623 (owner: 10Muehlenhoff) [13:41:08] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: fix bootstrap and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/1104632 (owner: 10Filippo Giunchedi) [13:41:53] 06SRE, 10Bitu, 06Infrastructure-Foundations: Bitu: Permission request state isn't refreshed if access has been revoked - https://phabricator.wikimedia.org/T382051#10406024 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:42:18] (03CR) 10Filippo Giunchedi: "Prep work for configuring Prometheus instances centrally" [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:42:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2098.codfw.wmnet with reason: host reimage [13:45:46] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:46:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2099.codfw.wmnet with reason: host reimage [13:47:04] (03CR) 10Elukey: [V:03+1 C:03+2] profile::maps::osm_master: allow kartotherian to run on Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1104601 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [13:49:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406046 (10phaultfinder) [13:51:58] (03CR) 10Btullis: [C:03+2] cephosd: enable the deployment of client cephx keys and minimal ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1104616 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [13:53:33] jouncebot: nowandnext [13:53:34] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [13:53:34] In 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1400) [13:54:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) (owner: 10Dreamy Jazz) [13:56:21] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [13:56:43] (03CR) 10Btullis: [V:03+1 C:03+2] ml-lab: Add a cephx key and minimal ceph.conf to ml-lab servers [puppet] - 10https://gerrit.wikimedia.org/r/1104621 (https://phabricator.wikimedia.org/T378735) (owner: 10Btullis) [13:56:55] I'll get started on my config change now as the wmf.6 backports seem that they will take a while. [13:57:07] So I can get mine done probably before the others have merged [13:57:15] (03CR) 10Dreamy Jazz: [C:03+2] Exclude autopromotion of temp IP viewer for users with specific global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) (owner: 10Dreamy Jazz) [13:57:35] (03CR) 10Filippo Giunchedi: "This LGTM, though IIRC from the task discussion the check itself didn't seem necessary/wanted ?" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [13:57:58] (03Merged) 10jenkins-bot: Exclude autopromotion of temp IP viewer for users with specific global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) (owner: 10Dreamy Jazz) [13:58:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [13:58:31] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: cadvisor: declare dependency on network being online [puppet] - 10https://gerrit.wikimedia.org/r/1078349 (owner: 10Arturo Borrero Gonzalez) [13:59:10] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1103380|Exclude autopromotion of temp IP viewer for users with specific global groups (T377929)]] [13:59:14] T377929: Don't auto-promote users with global temporary account IP viewing rights into the local 'checkuser-temporary-account-viewer' group - https://phabricator.wikimedia.org/T377929 [13:59:18] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1400). [14:00:05] MichaelG_WMF and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:28] o / [14:00:45] \o [14:01:21] Currently deploying my config patch as the wmf.6 backports probably won't merge until I've finished deploying my change. [14:01:22] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:02:10] (03CR) 10Giuseppe Lavagetto: [C:03+1] php8.1: bump images to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1104637 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:03:27] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1103380|Exclude autopromotion of temp IP viewer for users with specific global groups (T377929)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:03:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2098.codfw.wmnet with OS bookworm [14:03:46] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:03:50] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:36] (03PS1) 10Elukey: charts: get Kartotherian postgres password from env in config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104644 [14:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406125 (10phaultfinder) [14:07:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2099.codfw.wmnet with OS bookworm [14:07:34] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:47] (03CR) 10Elukey: [C:03+2] charts: get Kartotherian postgres password from env in config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104644 (owner: 10Elukey) [14:08:59] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2098.codfw.wmnet [14:09:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2098.codfw.wmnet [14:09:06] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2099.codfw.wmnet [14:09:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2099.codfw.wmnet [14:09:15] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1103380|Exclude autopromotion of temp IP viewer for users with specific global groups (T377929)]] (duration: 10m 05s) [14:09:19] T377929: Don't auto-promote users with global temporary account IP viewing rights into the local 'checkuser-temporary-account-viewer' group - https://phabricator.wikimedia.org/T377929 [14:09:25] I'm done with my backport. [14:09:58] I'm not sure I can run the window in general as I have to go in a few mins. [14:10:35] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:10:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2097.codfw.wmnet [14:11:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2097.codfw.wmnet [14:11:31] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2096.codfw.wmnet [14:12:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2096.codfw.wmnet [14:12:37] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2097.codfw.wmnet with OS bookworm [14:12:38] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2096.codfw.wmnet with OS bookworm [14:12:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2097 [14:12:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2097 [14:12:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2096 [14:12:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2096 [14:14:01] @Dreamy_Jazz don't worry I'll look for someone [14:15:01] Lucas_WMDE TheresNoTime do you happen to be around to run the baclport window? [14:15:28] (03PS1) 10Elukey: profile::maps::osm_replica: allow kartotherian from k8s [puppet] - 10https://gerrit.wikimedia.org/r/1104649 (https://phabricator.wikimedia.org/T216826) [14:15:50] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:28] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4696/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104649 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:18:14] (03CR) 10Andrew Bogott: [C:03+2] "Jesse, are you objecting to this patch or just suggesting additional" [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [14:18:39] what's up MichaelG_WMF [14:19:01] @Amir1 the backports: [14:19:15] popup one only so that CI works [14:19:32] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406190 (10phaultfinder) [14:19:47] and then the two growth ones [14:19:59] (03CR) 10Ladsgroup: [C:03+2] Kick bundlesize out of package.json [extensions/Popups] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104629 (https://phabricator.wikimedia.org/T382192) (owner: 10Michael Große) [14:20:03] one is testable and the other one is only about statistics [14:20:39] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:21:00] (03PS1) 10Btullis: cephosd: Open the ceph daemon ports to the ANALYTICS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1104650 (https://phabricator.wikimedia.org/T380279) [14:21:16] PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:57] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4697/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104650 (https://phabricator.wikimedia.org/T380279) (owner: 10Btullis) [14:22:04] cool [14:22:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [14:22:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10406196 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1... [14:24:30] (03CR) 10Krinkle: [C:03+1] Enable canShellboxGetTempUrl on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104398 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [14:29:52] MichaelG_WMF: Do you want me to deploy all three together? [14:30:09] sure! [14:30:10] (03PS1) 10Muehlenhoff: Enable signups.validators.IsUsernameEmail validator [puppet] - 10https://gerrit.wikimedia.org/r/1104651 (https://phabricator.wikimedia.org/T382226) [14:30:15] @Amir1 [14:30:21] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2097.codfw.wmnet with reason: host reimage [14:30:24] (03CR) 10Ladsgroup: [C:03+2] stats(surfacing): track link recommendation api recommendations [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [14:30:28] (03CR) 10Ladsgroup: [C:03+2] fix(surfacing): Show highlights in lists as well [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104633 (https://phabricator.wikimedia.org/T381841) (owner: 10Michael Große) [14:30:37] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2096.codfw.wmnet with reason: host reimage [14:30:45] (03Merged) 10jenkins-bot: Kick bundlesize out of package.json [extensions/Popups] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104629 (https://phabricator.wikimedia.org/T382192) (owner: 10Michael Große) [14:32:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1104649 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:32:36] @Amir1: the Popup is a production noop, so I don't care about it either way. There GE stats one I will only see after a while in the new metrics showing up, so "no errors" is all the testing that is possible there right now. The GE fix one can actually be tested and I have already the wiki-articles for it ready [14:32:45] (03CR) 10Andrew Bogott: [C:03+2] "whoops, nevermind! Started to write that before I noticed you had already +1'd" [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [14:33:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104633 (https://phabricator.wikimedia.org/T381841) (owner: 10Michael Große) [14:33:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [14:33:09] (03CR) 10Elukey: [V:03+1 C:03+2] profile::maps::osm_replica: allow kartotherian from k8s [puppet] - 10https://gerrit.wikimedia.org/r/1104649 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:33:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2097.codfw.wmnet with reason: host reimage [14:35:00] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [14:35:00] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:35:52] MichaelG_WMF: it's being merged [14:36:00] once done, all deployed together [14:36:11] @Amir1 🙏 [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2096.codfw.wmnet with reason: host reimage [14:37:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:38:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406244 (10phaultfinder) [14:40:27] o/ [14:40:31] * Lucas_WMDE is around now if needed [14:41:20] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:43:48] (03CR) 10Alexandros Kosiaris: [C:03+1] php8.1: bump images to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1104637 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:44:00] (03CR) 10Krinkle: [C:04-1] Profiler: centralize metrics send to a function (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [14:44:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1104650 (https://phabricator.wikimedia.org/T380279) (owner: 10Btullis) [14:45:01] (03CR) 10Btullis: [V:03+1 C:03+2] cephosd: Open the ceph daemon ports to the ANALYTICS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1104650 (https://phabricator.wikimedia.org/T380279) (owner: 10Btullis) [14:46:18] (03PS1) 10Andrew Bogott: profile::puppet::agent: actually pass facts_soft_limit to puppet::agent [puppet] - 10https://gerrit.wikimedia.org/r/1104656 (https://phabricator.wikimedia.org/T381293) [14:47:47] (03CR) 10Slyngshede: [C:04-1] Enable signups.validators.IsUsernameEmail validator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104651 (https://phabricator.wikimedia.org/T382226) (owner: 10Muehlenhoff) [14:49:02] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:50:12] (03CR) 10Ottomata: [C:03+1] dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [14:50:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1104656 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [14:50:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:51:24] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:52:21] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:54:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:40] (03CR) 10Elukey: [C:03+1] Puppetserver: Update hooks [puppet] - 10https://gerrit.wikimedia.org/r/1104627 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:54:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2097.codfw.wmnet with OS bookworm [14:54:54] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:54:55] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:39] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance idm-test1001:9100) - https://phabricator.wikimedia.org/T381947#10406286 (10SLyngshede-WMF) p:05Triage→03Low a:03SLyngshede-WMF [14:56:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2096.codfw.wmnet with OS bookworm [14:57:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2096.codfw.wmnet [14:57:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2096.codfw.wmnet [14:57:18] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2097.codfw.wmnet [14:57:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2097.codfw.wmnet [14:57:58] The changes on master were merged in under 25 min. So, I'm hoping that CI will be done any moment now 🤞 [14:58:46] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2095.codfw.wmnet [14:58:52] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2094.codfw.wmnet [14:59:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2095.codfw.wmnet [15:01:37] (03Abandoned) 10WMDE-Fisch: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli) [15:02:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2094.codfw.wmnet [15:02:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [15:02:30] (03CR) 10Andrew Bogott: [C:03+2] profile::puppet::agent: actually pass facts_soft_limit to puppet::agent [puppet] - 10https://gerrit.wikimedia.org/r/1104656 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [15:02:34] (03Merged) 10jenkins-bot: stats(surfacing): track link recommendation api recommendations [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104624 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [15:03:00] (03CR) 10Hnowlan: [V:03+2 C:03+2] php8.1: bump images to pick up new mercurius [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1104637 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:03:56] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: increase shard count for cebwiki_content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104598 (https://phabricator.wikimedia.org/T379002) (owner: 10DCausse) [15:04:03] (03Merged) 10jenkins-bot: fix(surfacing): Show highlights in lists as well [extensions/GrowthExperiments] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1104633 (https://phabricator.wikimedia.org/T381841) (owner: 10Michael Große) [15:04:23] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1104629|Kick bundlesize out of package.json (T382192 T360590)]], [[gerrit:1104633|fix(surfacing): Show highlights in lists as well (T381841)]], [[gerrit:1104624|stats(surfacing): track link recommendation api recommendations (T378536)]] [15:04:33] T382192: CI is broken for lots of MW repos due to extension/Popups explicitly requiring node 18 - https://phabricator.wikimedia.org/T382192 [15:04:34] T360590: [EPIC] Create a set of tests for defining a performance budget - https://phabricator.wikimedia.org/T360590 [15:04:34] T381841: [wmf.6] Surfacing Add link highlight is not shown in some sections - https://phabricator.wikimedia.org/T381841 [15:04:35] T378536: Surfacing structured tasks: Create a proof of concept solution for generating Add Link suggestions on-the-fly - https://phabricator.wikimedia.org/T378536 [15:05:07] (03CR) 10Herron: [C:03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:06:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2094.codfw.wmnet with OS bookworm [15:06:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bookworm [15:06:34] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2094 [15:06:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2094 [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:05] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2095 [15:07:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2095 [15:07:25] (03CR) 10Herron: [C:03+1] "good call lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:08:15] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: refactor common functionality [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:09:08] !log ladsgroup@deploy2002 migr, ladsgroup: Backport for [[gerrit:1104629|Kick bundlesize out of package.json (T382192 T360590)]], [[gerrit:1104633|fix(surfacing): Show highlights in lists as well (T381841)]], [[gerrit:1104624|stats(surfacing): track link recommendation api recommendations (T378536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:09:18] * MichaelG_WMF is testing now [15:10:27] @Amir1 - it works as expected 👍 [15:10:31] !log ladsgroup@deploy2002 migr, ladsgroup: Continuing with sync [15:10:39] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:11] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:22] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Research allowing read-only access to the superset api from requestctl's web UI - https://phabricator.wikimedia.org/T379718#10406370 (10Joe) I would frankly go with option 1 so we are more flexible - I... [15:15:54] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104629|Kick bundlesize out of package.json (T382192 T360590)]], [[gerrit:1104633|fix(surfacing): Show highlights in lists as well (T381841)]], [[gerrit:1104624|stats(surfacing): track link recommendation api recommendations (T378536)]] (duration: 11m 30s) [15:15:56] (03CR) 10Muehlenhoff: Enable signups.validators.IsUsernameEmail validator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104651 (https://phabricator.wikimedia.org/T382226) (owner: 10Muehlenhoff) [15:16:02] T382192: CI is broken for lots of MW repos due to extension/Popups explicitly requiring node 18 - https://phabricator.wikimedia.org/T382192 [15:16:02] T360590: [EPIC] Create a set of tests for defining a performance budget - https://phabricator.wikimedia.org/T360590 [15:16:03] T381841: [wmf.6] Surfacing Add link highlight is not shown in some sections - https://phabricator.wikimedia.org/T381841 [15:16:03] T378536: Surfacing structured tasks: Create a proof of concept solution for generating Add Link suggestions on-the-fly - https://phabricator.wikimedia.org/T378536 [15:16:28] done MichaelG_WMF [15:16:32] sorry I was in a meeting [15:16:49] (03CR) 10Slyngshede: [C:03+1] "LGTM, after reading the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1104651 (https://phabricator.wikimedia.org/T382226) (owner: 10Muehlenhoff) [15:17:03] @Amir1 All good, thank you for running it! 💚 [15:17:17] (03CR) 10Herron: [C:03+2] thanos: query-frontend: remove max_item_size cache setting [puppet] - 10https://gerrit.wikimedia.org/r/1103352 (owner: 10Herron) [15:18:06] (03CR) 10JHathaway: [C:03+1] "no worries, I should have been more explicit in my message, apologies for dragging this one out." [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [15:18:19] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:18:24] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:19:44] 07Puppet, 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Too many puppet facts on toolforge k8s workers - https://phabricator.wikimedia.org/T381293#10406389 (10Andrew) 05Open→03Resolved This warning is no longer displayed, and having lots of facts doesn't seem to actually break anything. [15:19:48] (03PS3) 10Filippo Giunchedi: thanos: default to -15d for sidecar min_time [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) [15:19:48] (03PS3) 10Filippo Giunchedi: prometheus: refactor common functionality [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) [15:21:00] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10406399 (10CDanis) p:05Triage→03Medium [15:21:23] (03PS11) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [15:21:35] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [15:21:51] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [15:22:50] (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:23:26] (03PS1) 10Elukey: charts: fix kartotherian's http probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104669 [15:24:04] (03CR) 10Tiziano Fogli: [C:03+1] thanos: default to -15d for sidecar min_time [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:24:56] (03CR) 10Elukey: [C:03+2] charts: fix kartotherian's http probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104669 (owner: 10Elukey) [15:25:59] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2094.codfw.wmnet with reason: host reimage [15:26:40] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2095.codfw.wmnet with reason: host reimage [15:27:37] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104612 (owner: 10L10n-bot) [15:27:39] 06SRE, 06Infrastructure-Foundations, 10Mail: Log tls cipher information - https://phabricator.wikimedia.org/T381927#10406412 (10jhathaway) p:05Triage→03Low [15:27:43] (03CR) 10CI reject: [V:04-1] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:28:41] (03PS2) 10Daimona Eaytoy: tables-catalog: Update path of MW core schema [puppet] - 10https://gerrit.wikimedia.org/r/1103609 (https://phabricator.wikimedia.org/T382030) [15:28:44] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Update path of MW core schema [puppet] - 10https://gerrit.wikimedia.org/r/1103609 (https://phabricator.wikimedia.org/T382030) (owner: 10Daimona Eaytoy) [15:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406419 (10phaultfinder) [15:30:04] 06SRE, 06Infrastructure-Foundations: puppetserver* thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10406421 (10CDanis) 05Open→03Resolved node_memory_Cached_bytes is a good proxy for overall memory pressure and looks good since October, I think we can close... [15:30:45] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:32:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2094.codfw.wmnet with reason: host reimage [15:33:32] (03PS1) 10Hnowlan: mediawiki: move helm keep annotation to job level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104671 (https://phabricator.wikimedia.org/T371701) [15:34:26] (03CR) 10Scott French: [C:03+1] mediawiki: move helm keep annotation to job level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104671 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:34:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2095.codfw.wmnet with reason: host reimage [15:34:52] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: move helm keep annotation to job level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104671 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:37:01] (03CR) 10Hnowlan: [C:03+2] mediawiki: move helm keep annotation to job level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104671 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:37:13] jouncebot: nowandnext [15:37:13] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [15:37:13] In 0 hour(s) and 52 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1630) [15:37:31] I am going to do a sync-world in a few minutes to roll out new versions of the php8 base images [15:39:06] (03Merged) 10jenkins-bot: mediawiki: move helm keep annotation to job level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104671 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:40:59] !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to pick up new php8.1 base [15:42:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1012.eqiad.wmnet with OS bullseye [15:42:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10406488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.... [15:46:13] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:49:30] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10406510 (10Jhancock.wm) Hey Andrew, let me know when you are free this week. [15:49:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406511 (10phaultfinder) [15:50:13] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:13] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:39] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:40] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:51:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2094.codfw.wmnet with OS bookworm [15:51:42] (03CR) 10Scott French: [C:03+1] Enable canShellboxGetTempUrl on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104398 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [15:51:45] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:54:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2095.codfw.wmnet with OS bookworm [15:54:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [15:54:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2094.codfw.wmnet [15:54:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2094.codfw.wmnet [15:54:42] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2095.codfw.wmnet [15:54:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2095.codfw.wmnet [16:00:17] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [16:00:25] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [16:02:54] (03PS1) 10Herron: thanos-rule: route queries to thanos query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1104678 [16:03:28] !log hnowlan@deploy2002 Finished scap sync-world: Rebuild and deploy to pick up new php8.1 base (duration: 23m 06s) [16:04:12] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [16:04:17] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [16:05:17] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [16:05:21] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [16:05:25] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4699/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104678 (owner: 10Herron) [16:07:49] (03CR) 10Filippo Giunchedi: [C:03+1] thanos-rule: route queries to thanos query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1104678 (owner: 10Herron) [16:08:27] (03CR) 10Herron: [V:03+1 C:03+2] thanos-rule: route queries to thanos query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1104678 (owner: 10Herron) [16:10:57] FIRING: PuppetZeroResources: Puppet has failed generate resources on elastic2088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:28:58] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104683 (https://phabricator.wikimedia.org/T128546) [16:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10406641 (10phaultfinder) [16:30:04] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1630) [16:32:19] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104683 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:33:32] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104683 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:37:26] !log installing ipmitool bugfix updates from Bookworm point release [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:27] (03CR) 10Herron: [C:03+2] thanos: query-frontend: set labels.response-cache-config in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1103364 (owner: 10Herron) [16:44:21] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10406693 (10MoritzMuehlenhoff) [16:44:36] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1104683| Bumping portals to master (T128546)]] (duration: 09m 25s) [16:44:40] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:46:49] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10406698 (10Arnoldokoth) 05Open→03Resolved [16:47:02] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1104683| Bumping portals to master (T128546)]] (duration: 02m 25s) [16:47:42] FIRING: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:50:04] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan2002.codfw.wmnet are marked down but pooled: thanos-web_443: Servers titan2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:50:04] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan2002.codfw.wmnet are marked down but pooled: thanos-web_443: Servers titan2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:50:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:17] !incidents [16:51:17] 5544 (UNACKED) ProbeDown sre (10.2.1.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 codfw) [16:51:23] !ack 5544 [16:51:24] 5544 (ACKED) ProbeDown sre (10.2.1.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 codfw) [16:52:07] I think I know whats up there [16:52:27] herron: ah, thanks! was just starting to look for hefty queries [16:53:06] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:53:06] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:53:27] yeah should recover in a few [16:54:14] great, thank you - yeah, looks like pybal monitors are happy again [16:55:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:55:45] (03PS1) 10Herron: Revert "thanos: query-frontend: set labels.response-cache-config in systemd" [puppet] - 10https://gerrit.wikimedia.org/r/1104688 [16:55:53] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on elastic2088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:56:04] (03CR) 10CI reject: [V:04-1] Revert "thanos: query-frontend: set labels.response-cache-config in systemd" [puppet] - 10https://gerrit.wikimedia.org/r/1104688 (owner: 10Herron) [16:56:07] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:50] (03PS2) 10Herron: Revert "thanos: query-frontend: set labels.response-cache-config in systemd" [puppet] - 10https://gerrit.wikimedia.org/r/1104688 [16:57:50] (03CR) 10Herron: [C:03+2] Revert "thanos: query-frontend: set labels.response-cache-config in systemd" [puppet] - 10https://gerrit.wikimedia.org/r/1104688 (owner: 10Herron) [17:02:42] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:07:40] (03PS1) 10Herron: thanos: query-frontend: set labels.response-cache-config-file in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1104690 [17:11:23] (03PS2) 10Herron: thanos: query-frontend: set labels.response-cache-config-file in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1104690 [17:11:46] jouncebot: nowandnext [17:11:46] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [17:11:46] In 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1800) [17:11:46] In 0 hour(s) and 48 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1800) [17:12:30] (03CR) 10Herron: [C:03+2] thanos: query-frontend: set labels.response-cache-config-file in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1104690 (owner: 10Herron) [17:13:02] unless there are any objections, I'm going to run scap shortly to pick up a recent change to the releases repository [17:15:43] !log swfrench@deploy2002 Started scap sync-world: Deployment to pick up debug image changes - T381473 [17:15:48] T381473: Generate a dumps-enabled mediawiki image - https://phabricator.wikimedia.org/T381473 [17:17:42] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:21:26] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [17:21:39] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [17:22:32] !log swfrench@deploy2002 Finished scap sync-world: Deployment to pick up debug image changes - T381473 (duration: 06m 49s) [17:22:36] T381473: Generate a dumps-enabled mediawiki image - https://phabricator.wikimedia.org/T381473 [17:22:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:27:36] (03CR) 10Herron: [C:03+2] thanos: query-frontend: enable query-range.align-range-with-step [puppet] - 10https://gerrit.wikimedia.org/r/1103365 (owner: 10Herron) [17:27:48] (03PS2) 10Herron: thanos: query-frontend: enable query-range.align-range-with-step [puppet] - 10https://gerrit.wikimedia.org/r/1103365 [17:28:59] (03CR) 10Giuseppe Lavagetto: [C:03+1] kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:33:38] (03CR) 10Herron: [V:03+2 C:03+2] thanos: query-frontend: enable query-range.align-range-with-step [puppet] - 10https://gerrit.wikimedia.org/r/1103365 (owner: 10Herron) [17:36:01] jouncebot: nowandnext [17:36:01] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [17:36:01] In 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1800) [17:36:01] In 0 hour(s) and 23 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1800) [17:36:57] I'll be doing another image rebuild sync-world [17:37:25] (03CR) 10Hnowlan: [C:03+2] kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:40:44] (03PS1) 10Hnowlan: mw-videoscaler: pick up scap config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104691 (https://phabricator.wikimedia.org/T371700) [17:41:01] !log dancy@deploy2002 Installing scap version "4.133.0" for 213 host(s) [17:42:33] (03CR) 10Scott French: [C:03+1] mw-videoscaler: pick up scap config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104691 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:42:49] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: pick up scap config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104691 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:42:50] (03CR) 10Kamila Součková: [C:03+1] mw-videoscaler: pick up scap config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104691 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:42:51] (03CR) 10Clément Goubert: [C:03+1] mw-videoscaler: pick up scap config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104691 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [17:43:58] hnowlan: I'm installing a new version of scap right now. It'll be done in a couple of minutes. [17:44:37] dancy: ack, thanks [17:45:49] !log dancy@deploy2002 Installing scap version "4.133.0" for 1 host(s) [17:46:42] !log dancy@deploy2002 Installation of scap version "4.133.0" completed for 1 hosts [17:46:56] hnowlan: Clear [17:47:42] dancy: thanks! [17:58:55] o/ swfrench-wmf [18:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1800). [18:00:05] ottomata: A patch you scheduled for MediaWiki infrastructure (UTC late) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T1800). [18:00:26] (03PS1) 10Hnowlan: Revert "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104702 [18:01:44] ottomata: hey there! so, we currently in the process of undoing something, ETA 10m or so [18:01:48] okay [18:01:57] i wait! ty! [18:02:03] (03CR) 10Scott French: [C:03+1] Revert "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104702 (owner: 10Hnowlan) [18:02:13] thanks for your patience :) [18:02:43] (03CR) 10Hnowlan: [C:03+2] Revert "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104702 (owner: 10Hnowlan) [18:07:43] just waiting on puppet runs [18:08:11] sounds good, thanks hnowlan [18:09:55] ottomata: did you have a preferred mwdebug host to test against once your change is merged? [18:10:44] done, apologies for the delay! [18:10:44] i have a partiality for mwdebug1002 :) [18:10:58] swfrench-wmf: i can drive if you like, I was just hoping you could be here to babysit me [18:11:01] :) [18:11:47] ottomata: that totally works too :) [18:18:03] oh sorry! [18:18:06] that was a done ping! [18:18:09] okay okay1 [18:18:24] following https://phabricator.wikimedia.org/T353817#10401812 [18:18:27] mering [18:18:28] ah, I missed that as well [18:18:39] (03CR) 10Ottomata: [C:03+2] Rewrite mediawiki.org/beacon/event to /beacon/event/index.php [puppet] - 10https://gerrit.wikimedia.org/r/1063224 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:20:29] swfrench-wmf: i think to test this i need to bypass varnish [18:20:59] ottomata: so, you can curl from the deployment host (or another production host), right? [18:21:07] yes [18:21:16] but what url do I hit? if I do mediawiki.org will it bypass? [18:21:34] you can use `curl` with `--connect-to` as one option [18:21:41] oh [18:21:50] or you can use the handy httpbb check you made :) [18:22:22] oh that bypasses? huh yeah i guess it would have to [18:23:45] happy to workshop / review the test command if that helps [18:24:02] it works! [18:24:08] deploy1002 /home/otto/httpbb.yaml [18:24:14] httpbb ./httpbb.yaml --hosts=mwdebug1002.eqiad.wmnet [18:24:20] PASS: 13 requests sent to mwdebug1002.eqiad.wmnet. All assertions passed. [18:24:20] vs [18:24:31] 18:24:17 [@deploy2002:/home/otto] $ httpbb ./httpbb.yaml --hosts=mwdebug1001.eqiad.wmnet [18:24:34] Status code: expected 204, got 404. [18:24:37] so looking good! [18:24:57] and I get the test event in kafka too [18:24:59] okay! [18:25:12] nice! [18:25:42] running puppet on deploy1002 [18:25:56] then will proceed with scap sync-world --pause-after-testserver-sync and testing at mwdebug.discovery.wmnet [18:26:10] ottomata: so you'll want to be using deploy2022 [18:26:12] *2002 [18:26:16] oh [18:26:17] i am on that [18:26:22] sorry [18:26:26] ah, never mind :) [18:26:30] i use the deployment.eqiad cname [18:29:56] doing scap sync-world --pause-after-testserver-sync now [18:30:36] ack, great [18:31:08] !log otto@deploy2002 Started scap sync-world: T353817 - Apache rewrite mediawiki.org/beacon/event to /beacon/event/index.php [18:31:12] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [18:35:00] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [18:35:10] !log otto@deploy2002 otto: T353817 - Apache rewrite mediawiki.org/beacon/event to /beacon/event/index.php synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:35:14] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [18:36:15] swfrench-wmf: okay am at Changes synced to the testservers. step [18:36:54] my httpbb test isn't working... [18:36:55] 18:35:23 [@deploy2002:/home/otto] $ httpbb ./httpbb.yaml --hosts=mwdebug.discovery.wmnet [18:37:17] ERRORS: 13 requests attempted to mwdebug.discovery.wmnet. Errors connecting to 1 host. [18:37:17] you need to add a port: `--https_port=4444` [18:37:21] ah okay [18:37:32] PASS: 13 requests sent to mwdebug.discovery.wmnet. All assertions passed. [18:37:34] beautiful [18:37:35] okay [18:37:39] fantastic [18:37:42] and i get the event in kafka :) [18:37:55] proceeding [18:37:58] !log otto@deploy2002 otto: Continuing with sync [18:38:21] okay joining a meeting, will watch [18:38:23] thank you swfrench-wmf ! [18:38:34] ottomata: thanks for doing all the work! :) [18:38:38] i'll do some more testing once this is out, and then prep a vcl patch to review. I'll get the traffic team to help me with that in the new year [18:38:57] sounds good [18:39:53] swfrench-wmf: error in output, but deployment is proceeding. is this normal? [18:40:00] https://www.irccloud.com/pastebin/jRykITar/ [18:40:33] just got that two more times for two different things again [18:41:05] that error is not normal, no - I'll take a look [18:41:20] not a consequence of your change, though [18:41:24] also yeah [18:41:26] https://www.irccloud.com/pastebin/nWgBb7Aa/ [18:41:29] stuff like this... [18:41:51] deployment stil progressing though [18:42:41] yeah, so if I had to guess, if there's a transient issue with connectivity to the k8s control plane in eqiad (looks like that might be happening here), all the various spots where scap will attempt to parse subcommand outputs may fail [18:42:58] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:43:27] yeah, there we are [18:43:58] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:44:23] 10SRE-Access-Requests, 06Data-Platform-SRE: Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10407167 (10Ahoelzl) [18:48:42] okay swfrench-wmf deployment finished. [18:48:44] https://www.irccloud.com/pastebin/MQ0Sbniw/ [18:49:39] ottomata: thanks for the heads up - still looking at what happened, but in any case, I'll take it from here once I have a chance to look around [18:50:00] if you could hold off on merging your httpbb patch for a while, that would be good [18:50:20] (since not all k8s deployments may have the new image yet if some failed) [18:50:38] okay thanks! [18:50:49] swfrench-wmf: you can merge that at will when you think its good to go [18:50:52] thank ou so much! [18:51:01] cool, can do [18:54:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:09] alright, for folks following along, it looks like `kube-apiserver-safe-restart.service` happened to get notified (and bounce the API server) on wikikube-ctrl1003 mid-deployment [18:57:24] (03PS1) 10Eevans: restbase1028: canary Cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1104716 (https://phabricator.wikimedia.org/T380420) [18:58:20] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1104716 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:00:35] jouncebot: nowandnext [19:00:35] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [19:00:35] In 1 hour(s) and 59 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T2100) [19:01:13] FYI, I'm going to continue past the end of the window a bit to clear any latent diffs from the issue above ^^ [19:01:32] (03CR) 10Eevans: [C:03+2] restbase1028: canary Cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1104716 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:03:00] !log swfrench@deploy2002 Started scap sync-world: T353817 - Additional deployment to clear remaining diffs [19:03:04] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [19:05:51] !log swfrench@deploy2002 Finished scap sync-world: T353817 - Additional deployment to clear remaining diffs (duration: 02m 51s) [19:07:12] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1028.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [19:07:16] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [19:08:59] (03CR) 10Scott French: [C:03+1] "Thanks again for adding this!" [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:09:01] (03CR) 10Scott French: [C:03+2] httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:16:34] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1028.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [19:16:38] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [19:26:28] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:27:18] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Swift [19:28:48] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:29:38] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Swift [19:33:38] !log joal@deploy2002 Started deploy [airflow-dags/analytics@afda9d9]: Airflow analytics backfill deploy [airflow-dags/analytics@afda9d9a] [19:36:37] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@afda9d9]: Airflow analytics backfill deploy [airflow-dags/analytics@afda9d9a] (duration: 02m 58s) [19:56:38] (03PS1) 10Muehlenhoff: Remove tarlogic1 from admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/1104725 [20:25:18] (03PS1) 10Bking: wdqs-internal-main: add wdqs1025 to LB pool [puppet] - 10https://gerrit.wikimedia.org/r/1104726 (https://phabricator.wikimedia.org/T376150) [20:38:39] (03PS2) 10Urbanecm: [Growth] Make the typage campaign not specific to 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102350 (https://phabricator.wikimedia.org/T380405) [20:45:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [20:51:41] (03PS2) 10Scott French: ops-maint-gcal.js: truncate message details [software] - 10https://gerrit.wikimedia.org/r/1104727 (https://phabricator.wikimedia.org/T381680) [20:51:41] (03CR) 10Scott French: "Happy to go about this in a different way if preferred. Thanks in advance for the review!" [software] - 10https://gerrit.wikimedia.org/r/1104727 (https://phabricator.wikimedia.org/T381680) (owner: 10Scott French) [20:55:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101158 (https://phabricator.wikimedia.org/T224851) (owner: 10Pppery) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T2100). [21:00:05] Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:07] here [21:05:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [21:05:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10407485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1... [21:09:10] Pppery: do you need a deployer? [21:09:20] yes [21:09:45] alrighty [21:09:54] (03PS2) 10Pppery: Update VisualEditor config to drop exclusions based on Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101158 (https://phabricator.wikimedia.org/T224851) [21:10:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101158 (https://phabricator.wikimedia.org/T224851) (owner: 10Pppery) [21:11:00] (03Merged) 10jenkins-bot: Update VisualEditor config to drop exclusions based on Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101158 (https://phabricator.wikimedia.org/T224851) (owner: 10Pppery) [21:11:16] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101158|Update VisualEditor config to drop exclusions based on Flow (T224851)]] [21:11:21] T224851: Please centralize enwiki's feedback for VisualEditor - https://phabricator.wikimedia.org/T224851 [21:15:40] !log cjming@deploy2002 cjming, pppery: Backport for [[gerrit:1101158|Update VisualEditor config to drop exclusions based on Flow (T224851)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:43] Pppery: on test servers if you'd like to check [21:15:48] on it [21:16:11] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [21:17:38] Tested it on enwiki, confirmed it worked. Assuming the other wikis are similarly situated (and nobody has used the feedback feature there in ages) [21:17:42] TLDR proceed [21:17:48] cool [21:17:51] !log cjming@deploy2002 cjming, pppery: Continuing with sync [21:19:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [21:23:12] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101158|Update VisualEditor config to drop exclusions based on Flow (T224851)]] (duration: 11m 56s) [21:23:16] T224851: Please centralize enwiki's feedback for VisualEditor - https://phabricator.wikimedia.org/T224851 [21:23:28] Thanks [21:23:37] yw! [21:24:52] !log end of UTC late backport window [21:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:10] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [21:36:39] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [21:36:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS bullseye [21:36:43] (03PS1) 10Pppery: Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) [21:36:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10407575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.... [21:57:49] (03PS9) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [21:58:09] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [22:00:05] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241216T2200). [22:09:56] (03CR) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [22:10:23] (03PS14) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [22:11:10] (03CR) 10Scott French: [C:03+1] Enable canShellboxGetTempUrl everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [22:14:44] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal-main: add wdqs1025 to LB pool [puppet] - 10https://gerrit.wikimedia.org/r/1104726 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [22:16:58] (03PS2) 10Scott French: trafficserver: validate production config in tests [puppet] - 10https://gerrit.wikimedia.org/r/1101104 (https://phabricator.wikimedia.org/T377042) [22:17:18] (03PS10) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [22:22:30] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-internal-main,service=wdqs-main [22:23:43] (03PS11) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [22:24:03] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [22:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10407645 (10phaultfinder) [22:34:14] (03PS12) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [22:34:57] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4704/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [22:35:14] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [22:35:15] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [22:36:12] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [22:47:43] (03PS4) 10Cwhite: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 [22:48:49] (03CR) 10Cwhite: Profiler: centralize metrics send to a function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [22:53:12] (03PS13) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [22:53:32] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [22:54:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:59] (03PS14) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [22:58:18] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [23:08:55] (03CR) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [23:15:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:16:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104398 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [23:17:20] (03Merged) 10jenkins-bot: Enable canShellboxGetTempUrl on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104398 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [23:17:35] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1104398|Enable canShellboxGetTempUrl on testwiki (T292322)]] [23:17:40] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [23:23:11] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1104398|Enable canShellboxGetTempUrl on testwiki (T292322)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:23:15] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [23:24:08] !log tstarling@deploy2002 tstarling: Continuing with sync [23:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10407754 (10phaultfinder) [23:29:36] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104398|Enable canShellboxGetTempUrl on testwiki (T292322)]] (duration: 12m 00s) [23:29:40] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [23:40:54] (03CR) 10BryanDavis: [C:04-1] php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [23:44:31] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 [23:48:40] (03PS15) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [23:49:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10407786 (10phaultfinder) [23:54:29] (03CR) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)