[00:06:57] (03PS2) 10Scott French: Move low-traffic consumer latency alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1092906 (https://phabricator.wikimedia.org/T378609) [00:14:30] (03CR) 10Scott French: "Thanks in advance for the review, Reuven." [alerts] - 10https://gerrit.wikimedia.org/r/1092906 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [00:15:29] !log [urbanecm@deploy2002 ~]$ mwscript-k8s -- extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=azwiki --all --verbose # T380329 [00:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:33] T380329: az.wikipedia: Add Link tasks are completed while violating excludedSections configuration - https://phabricator.wikimedia.org/T380329 [00:16:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:19:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:43] (03CR) 10Cwhite: [C:03+1] "Given that we already filter for exported_cluster here, it makes sense to apply the same filters to the role_owner metric." [alerts] - 10https://gerrit.wikimedia.org/r/1093302 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli) [00:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1093473 [00:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1093473 (owner: 10TrainBranchBot) [00:40:30] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase2036.codfw.wmnet [00:40:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2036.codfw.wmnet [00:40:32] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase2037.codfw.wmnet [00:40:33] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2037.codfw.wmnet [00:40:34] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase2038.codfw.wmnet [00:40:35] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2038.codfw.wmnet [00:42:12] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2021.codfw.wmnet with reason: Decommissioning — T380236 [00:42:16] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2021.codfw.wmnet with reason: Decommissioning — T380236 [00:42:16] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [00:42:17] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2022.codfw.wmnet with reason: Decommissioning — T380236 [00:42:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2022.codfw.wmnet with reason: Decommissioning — T380236 [00:42:33] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2023.codfw.wmnet with reason: Decommissioning — T380236 [00:42:46] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2023.codfw.wmnet with reason: Decommissioning — T380236 [00:45:56] !log decommissioning Cassandra/restbase2021-{a,b,c} — T380236 [00:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:43] (03CR) 10RLazarus: [C:03+1] Move low-traffic consumer latency alert to critical (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1092906 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [01:08:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1093481 [01:08:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1093481 (owner: 10TrainBranchBot) [01:09:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1093473 (owner: 10TrainBranchBot) [01:43:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1093481 (owner: 10TrainBranchBot) [01:46:30] PROBLEM - Disk space on Hadoop worker on an-worker1090 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:16:54] (03PS1) 10Bartosz Dziewoński: Remove temporary fix for badly set cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 [02:17:11] (03PS2) 10Bartosz Dziewoński: Remove temporary fix for badly set CentralAuth cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 [02:30:49] !log Import libvmod-re2_2.0.0-2~bpo11u1 into varnish-staging apt component [02:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:30] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10342669 (10phaultfinder) [03:44:25] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10342672 (10phaultfinder) [04:37:05] FIRING: [2x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:31] RECOVERY - Host lsw1-c4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [05:01:15] PROBLEM - BGP status on lsw1-c4-codfw.mgmt is CRITICAL: BGP CRITICAL - No response from remote host 10.193.1.233 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:01:25] PROBLEM - Juniper alarms on lsw1-c4-codfw.mgmt is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.193.1.233 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [05:06:55] PROBLEM - Host lsw1-c4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:27:03] PROBLEM - Disk space on Hadoop worker on an-worker1112 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:28:47] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:36:39] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [05:39:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:39:45] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:39:49] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:32] (03CR) 10AikoChou: [C:03+2] ml-services: update articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092994 (https://phabricator.wikimedia.org/T374034) (owner: 10AikoChou) [05:41:21] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [05:42:05] (03Merged) 10jenkins-bot: ml-services: update articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092994 (https://phabricator.wikimedia.org/T374034) (owner: 10AikoChou) [05:51:06] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [06:24:02] (03PS1) 10KartikMistry: Enable the Contribute menu in 4th group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093733 (https://phabricator.wikimedia.org/T375303) [06:44:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093733 (https://phabricator.wikimedia.org/T375303) (owner: 10KartikMistry) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T0700) [07:00:05] marostegui, Amir1, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T0700). [07:48:31] !log removing ganeti1017 from active Ganeti nodes T378921 [07:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:36] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [07:49:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10342853 (10MoritzMuehlenhoff) [07:50:55] PROBLEM - ganeti-noded running on ganeti1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [07:50:59] (03PS1) 10Muehlenhoff: ganeti1017: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1093819 [07:51:41] PROBLEM - ganeti-confd running on ganeti1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [07:52:05] FIRING: [3x] ProbeDown: Service ganeti1017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:58:49] (03PS1) 10Slyngshede: API: Disable authentication for username API [software/bitu] - 10https://gerrit.wikimedia.org/r/1093822 (https://phabricator.wikimedia.org/T364605) [07:59:55] again failed probes [08:00:04] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T0800). [08:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:20] (03CR) 10Muehlenhoff: [C:03+2] ganeti1017: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1093819 (owner: 10Muehlenhoff) [08:04:10] ah. I'm here. [08:04:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093733 (https://phabricator.wikimedia.org/T375303) (owner: 10KartikMistry) [08:05:34] (03Merged) 10jenkins-bot: Enable the Contribute menu in 4th group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093733 (https://phabricator.wikimedia.org/T375303) (owner: 10KartikMistry) [08:05:54] (03CR) 10Arnaudb: [C:03+2] mariadb: db1246 temporary insetup [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb) [08:06:55] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1093733|Enable the Contribute menu in 4th group of Wikis (T375303)]] [08:06:59] T375303: Enable the Contribute menu in 4th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375303 [08:09:59] !log kartik@deploy2002 kartik: Backport for [[gerrit:1093733|Enable the Contribute menu in 4th group of Wikis (T375303)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:14:00] !log kartik@deploy2002 kartik: Continuing with sync [08:21:00] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093733|Enable the Contribute menu in 4th group of Wikis (T375303)]] (duration: 14m 05s) [08:21:04] T375303: Enable the Contribute menu in 4th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375303 [08:22:05] FIRING: [3x] ProbeDown: Service ganeti1017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:29:10] RESOLVED: ProbeDown: Service ganeti1017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:36:24] (03CR) 10Muehlenhoff: [C:03+1] "Agreed, that seems fine to expose w/o authentication" [software/bitu] - 10https://gerrit.wikimedia.org/r/1093822 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede) [08:36:53] (03CR) 10Slyngshede: [C:03+2] API: Disable authentication for username API [software/bitu] - 10https://gerrit.wikimedia.org/r/1093822 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede) [08:37:56] (03CR) 10Muehlenhoff: [C:03+2] idp-test: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1093332 (owner: 10Muehlenhoff) [08:39:24] (03Merged) 10jenkins-bot: API: Disable authentication for username API [software/bitu] - 10https://gerrit.wikimedia.org/r/1093822 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede) [08:45:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [08:45:28] (03CR) 10Muehlenhoff: [C:03+1] peopleweb: limit envoy srange to CACHES and DEPLOYMENT_SERVERS [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:57:39] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1093852 [09:00:04] andre and brennen: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T0900). [09:00:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [09:03:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [09:04:08] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093854 (https://phabricator.wikimedia.org/T375663) [09:04:10] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093854 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [09:05:05] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093854 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [09:07:43] !log installing exim4 security updates [09:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:25] (03CR) 10Elukey: [C:03+1] Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1093852 (owner: 10Muehlenhoff) [09:16:16] (03CR) 10Ayounsi: [C:03+1] Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:17:54] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.4 refs T375663 [09:17:59] T375663: 1.44.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T375663 [09:18:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm [09:24:17] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm: [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1093852 (owner: 10Muehlenhoff) [09:30:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:52] !log installing nghttp2 security updates [09:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:44] (03PS6) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [09:47:30] (03PS1) 10Raymond Ndibe: profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) [09:49:56] (03CR) 10CI reject: [V:04-1] profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [09:50:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [09:50:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [09:50:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:50:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:51:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P71109 and previous config saved to /var/cache/conftool/dbconfig/20241121-095102-arnaudb.json [09:51:06] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:53:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P71110 and previous config saved to /var/cache/conftool/dbconfig/20241121-095313-arnaudb.json [09:54:19] jouncebot: nowandnext [09:54:19] For the next 1 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T0900) [09:54:19] In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1100) [09:59:12] !log restarting eventgate-main@codfw [09:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:12] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [10:01:27] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [10:05:56] (03CR) 10David Caro: profile::manifests::toolforge::harbor: add s3 auth to harbor config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [10:06:47] (03PS2) 10Raymond Ndibe: profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) [10:07:11] (03PS3) 10Raymond Ndibe: profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) [10:08:12] (03PS1) 10Muehlenhoff: thumbor: update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093862 [10:08:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P71111 and previous config saved to /var/cache/conftool/dbconfig/20241121-100821-arnaudb.json [10:12:55] (03PS4) 10Raymond Ndibe: profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) [10:13:29] (03CR) 10Btullis: wdqs: create wdqs-internal-[main,scholarly] roles (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [10:15:08] (03CR) 10CI reject: [V:04-1] profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [10:15:27] 06SRE, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10343078 (10jcrespo) 14 hours more for transfers to complete. [10:17:02] (03CR) 10Muehlenhoff: sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [10:18:19] (03PS1) 10JMeybohm: Reclaim kubernetes100[78] as kubestage100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1093864 (https://phabricator.wikimedia.org/T380043) [10:18:38] 06SRE, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10343095 (10jcrespo) [10:18:58] (03PS5) 10Raymond Ndibe: profile::manifests::toolforge::harbor: add s3 auth to harbor config [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) [10:19:19] !log ayounsi@cumin1002 START - Cookbook sre.network.debug for Netbox circuit ID 102 [10:19:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 102 [10:23:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P71112 and previous config saved to /var/cache/conftool/dbconfig/20241121-102328-arnaudb.json [10:24:37] (03CR) 10Hnowlan: [C:03+1] thumbor: update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093862 (owner: 10Muehlenhoff) [10:24:40] (03CR) 10Raymond Ndibe: profile::manifests::toolforge::harbor: add s3 auth to harbor config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [10:25:57] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1005.eqiad.wmnet with OS bullseye [10:28:48] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093866 (https://phabricator.wikimedia.org/T380329) [10:31:42] (03CR) 10Urbanecm: [C:03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093866 (https://phabricator.wikimedia.org/T380329) (owner: 10Urbanecm) [10:32:46] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093866 (https://phabricator.wikimedia.org/T380329) (owner: 10Urbanecm) [10:32:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: fix installation path for designate-sink plugin [puppet] - 10https://gerrit.wikimedia.org/r/1093867 (https://phabricator.wikimedia.org/T380208) [10:33:14] jynus: fyi, https://phabricator.wikimedia.org/T380451 [10:33:27] !log urbanecm@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [10:33:54] (03CR) 10Clément Goubert: [C:03+1] Reclaim kubernetes100[78] as kubestage100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1093864 (https://phabricator.wikimedia.org/T380043) (owner: 10JMeybohm) [10:33:58] PROBLEM - Kafka broker TLS certificate validity on kafka-main1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:34:06] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [10:34:07] PROBLEM - Kafka Broker Server #page on kafka-main1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [10:34:27] acked [10:34:29] !log urbanecm@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [10:34:38] ah sorry [10:34:47] my bad, downtime expires [10:34:48] expected maintenance, effie? [10:34:52] effie: you broke the broker? [10:34:55] ah, then all is good [10:35:02] yes, [10:35:11] :) [10:35:20] XioNoX wins [10:35:22] can you renew it to your liking ? [10:35:28] XioNoX: I broke the broker lol yes [10:35:41] jynus: yes, let me get to a pc [10:35:54] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093867 (https://phabricator.wikimedia.org/T380208) (owner: 10Arturo Borrero Gonzalez) [10:36:43] !log urbanecm@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [10:37:27] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage [10:38:11] !log urbanecm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [10:38:14] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [10:38:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P71113 and previous config saved to /var/cache/conftool/dbconfig/20241121-103834-arnaudb.json [10:38:39] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:38:59] broke and brokerer [10:39:14] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [10:39:30] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: designate: fix installation path for designate-sink plugin [puppet] - 10https://gerrit.wikimedia.org/r/1093867 (https://phabricator.wikimedia.org/T380208) (owner: 10Arturo Borrero Gonzalez) [10:40:41] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1007-1008].eqiad.wmnet [10:41:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage [10:41:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1007-1008].eqiad.wmnet [10:53:45] (03PS2) 10JMeybohm: Reclaim kubernetes100[78] as kubestage100[56] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/1093864 (https://phabricator.wikimedia.org/T380043) [10:53:45] (03PS1) 10JMeybohm: Reclaim kubernetes100[78] as kubestage100[56] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/1093871 (https://phabricator.wikimedia.org/T380043) [10:59:08] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1100) [11:00:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [11:00:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1005.eqiad.wmnet with OS bullseye [11:06:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10343273 (10elukey) @Jclark-ctr all configured, the host has been reimaged and all the disks are shows up. @jhathaway Had to reimage only once, but the h... [11:07:10] 06SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 06MW-Interfaces-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424#10343277 (10MSantos) [11:07:38] (03PS1) 10Muehlenhoff: puppetboard: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093873 [11:08:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093873 (owner: 10Muehlenhoff) [11:18:20] (03CR) 10Clément Goubert: [C:03+1] Reclaim kubernetes100[78] as kubestage100[56] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/1093864 (https://phabricator.wikimedia.org/T380043) (owner: 10JMeybohm) [11:18:25] (03CR) 10Clément Goubert: [C:03+1] Reclaim kubernetes100[78] as kubestage100[56] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/1093871 (https://phabricator.wikimedia.org/T380043) (owner: 10JMeybohm) [11:21:13] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10343339 (10kostajh) >>! In T375006#10317118, @akosiaris wrote: > Finally, let me say that for the last week, the logstash dashboard says 133 errors. The service ser... [11:26:28] (03PS2) 10Clément Goubert: wikikube: Add wikikube-worker21[56-70] [puppet] - 10https://gerrit.wikimedia.org/r/1092816 (https://phabricator.wikimedia.org/T376966) [11:32:40] (03PS1) 10Giuseppe Lavagetto: aptrepo: add import for vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/1093875 [11:32:40] jouncebot: nowandnext [11:32:40] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1100) [11:32:40] In 1 hour(s) and 27 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1300) [11:36:21] (03CR) 10Krinkle: [C:03+1] Disable various extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092334 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [11:39:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10343404 (10Clement_Goubert) I have the same issue on `wikikube-worker2157.codfw.wmnet`, the interface in netbox is `eno12409np1` but it has no link, whe... [11:41:13] (03CR) 10Clément Goubert: [C:03+2] wikikube: Add wikikube-worker21[56-70] [puppet] - 10https://gerrit.wikimedia.org/r/1092816 (https://phabricator.wikimedia.org/T376966) (owner: 10Clément Goubert) [11:41:25] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10343409 (10BTullis) [11:44:50] 06SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 06MW-Interfaces-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424#10343419 (10MSantos) a:05DAlangi_WMF→03None [11:50:14] (03CR) 10Muehlenhoff: [C:03+2] thumbor: update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093862 (owner: 10Muehlenhoff) [11:56:01] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [11:56:12] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:58:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:00:17] <_joe_> uhm let me see [12:01:09] (03PS1) 10MVernon: thanos: add new backends to profile::thanos::swift::backends [puppet] - 10https://gerrit.wikimedia.org/r/1093884 (https://phabricator.wikimedia.org/T370452) [12:01:12] (03PS1) 10MVernon: thanos: storage schema for larger disks_by_path backends, add 2 [puppet] - 10https://gerrit.wikimedia.org/r/1093885 (https://phabricator.wikimedia.org/T370452) [12:02:03] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:13] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:03:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:03:53] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:08:35] (03PS12) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [12:09:14] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:09:28] (03CR) 10Hnowlan: "Adding rzl to CC because this comes close to existing mwscript stuff just as a heads-up. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:09:46] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:12:33] (03PS1) 10Sergio Gimeno: ExperimentUserDefaultsManager: use read latest when retrieving central id [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093889 (https://phabricator.wikimedia.org/T379682) [12:12:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093889 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [12:13:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2156.codfw.wmnet with OS bookworm [12:13:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2158.codfw.wmnet with OS bookworm [12:16:22] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:16:30] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1093890 (owner: 10L10n-bot) [12:16:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2161.codfw.wmnet with OS bookworm [12:17:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2160.codfw.wmnet with OS bookworm [12:17:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2162.codfw.wmnet with OS bookworm [12:18:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2163.codfw.wmnet with OS bookworm [12:18:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2164.codfw.wmnet with OS bookworm [12:19:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2165.codfw.wmnet with OS bookworm [12:32:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2156.codfw.wmnet with reason: host reimage [12:32:49] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2158.codfw.wmnet with reason: host reimage [12:35:49] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2161.codfw.wmnet with reason: host reimage [12:36:01] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2160.codfw.wmnet with reason: host reimage [12:36:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2156.codfw.wmnet with reason: host reimage [12:36:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2162.codfw.wmnet with reason: host reimage [12:37:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2163.codfw.wmnet with reason: host reimage [12:37:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2164.codfw.wmnet with reason: host reimage [12:38:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2165.codfw.wmnet with reason: host reimage [12:39:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2161.codfw.wmnet with reason: host reimage [12:42:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2158.codfw.wmnet with reason: host reimage [12:45:35] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4562/co" [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [12:46:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2163.codfw.wmnet with reason: host reimage [12:47:24] (03CR) 10David Caro: [V:03+1] "LGTM, you can try cherry-picking this manually in toolsbeta, and setting the config on hiera (horizon)." [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [12:49:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2165.codfw.wmnet with reason: host reimage [12:51:35] (03PS1) 10EoghanGaffney: vrts: Block bondedsender RBL check from spamassassin on vrts [puppet] - 10https://gerrit.wikimedia.org/r/1093905 (https://phabricator.wikimedia.org/T380396) [12:52:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2162.codfw.wmnet with reason: host reimage [12:55:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2164.codfw.wmnet with reason: host reimage [12:55:27] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093905 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [12:55:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2156.codfw.wmnet with OS bookworm [12:58:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2160.codfw.wmnet with reason: host reimage [12:58:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2161.codfw.wmnet with OS bookworm [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1300) [13:02:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2158.codfw.wmnet with OS bookworm [13:05:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2163.codfw.wmnet with OS bookworm [13:10:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2165.codfw.wmnet with OS bookworm [13:10:50] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp5018*} and A:cp for 9.2.6-1wm2 [13:11:53] (03CR) 10JMeybohm: [C:03+2] Reclaim kubernetes100[78] as kubestage100[56] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/1093864 (https://phabricator.wikimedia.org/T380043) (owner: 10JMeybohm) [13:11:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2162.codfw.wmnet with OS bookworm [13:14:17] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp5018*} and A:cp for 9.2.6-1wm2 [13:14:28] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp5026*} and A:cp for 9.2.6-1wm2 [13:14:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2164.codfw.wmnet with OS bookworm [13:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:05] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:17:25] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from kubernetes1007 to kubestage1005 [13:17:48] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [13:17:51] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:18:13] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp5026*} and A:cp for 9.2.6-1wm2 [13:18:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2160.codfw.wmnet with OS bookworm [13:18:51] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [13:18:57] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [13:21:18] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1007 to kubestage1005 - jayme@cumin2002" [13:22:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1007 to kubestage1005 - jayme@cumin2002" [13:22:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:52] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kubestage1005 [13:24:15] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubestage1005 [13:24:55] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1007 to kubestage1005 [13:25:46] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from kubernetes1008 to kubestage1006 [13:25:57] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [13:27:57] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage1005.eqiad.wmnet with OS bookworm [13:30:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:57] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1008 to kubestage1006 - jayme@cumin2002" [13:31:15] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1008 to kubestage1006 - jayme@cumin2002" [13:31:15] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:16] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kubestage1006 [13:31:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:29] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubestage1006 [13:33:10] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1008 to kubestage1006 [13:34:30] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage1006.eqiad.wmnet with OS bookworm [13:34:51] (03CR) 10JMeybohm: [C:03+2] Reclaim kubernetes100[78] as kubestage100[56] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/1093871 (https://phabricator.wikimedia.org/T380043) (owner: 10JMeybohm) [13:39:53] cr*-eqiad is me [13:40:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:42:07] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 100 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:44:16] (03CR) 10Brouberol: [C:03+1] Update spark shufflers on the test cluster to deploy version 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040) (owner: 10Btullis) [13:44:31] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1005.eqiad.wmnet with reason: host reimage [13:45:45] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 666, down: 11, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:46:03] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 44 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:47:31] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1005.eqiad.wmnet with reason: host reimage [13:47:47] and there was increased krb1001 activity in the last 15 minutes [13:50:55] RESOLVED: MaxConntrack: Max conntrack at 91.16% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:51:05] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1006.eqiad.wmnet with reason: host reimage [13:51:18] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 709, down: 17, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:51:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:51:58] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:53:05] (03PS6) 10Arnaudb: mariadb: add instance metric polling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091190 (https://phabricator.wikimedia.org/T376596) [13:54:21] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1006.eqiad.wmnet with reason: host reimage [13:57:01] (03CR) 10CI reject: [V:04-1] mariadb: add instance metric polling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091190 (https://phabricator.wikimedia.org/T376596) (owner: 10Arnaudb) [13:57:02] (03PS7) 10Arnaudb: mariadb: add instance metric polling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091190 (https://phabricator.wikimedia.org/T376596) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1400). [14:00:05] EggRoll97 and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] Here. [14:00:19] hi [14:02:14] (03PS1) 10Isabelle Hurbain-Palatin: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093919 (https://phabricator.wikimedia.org/T379647) [14:03:24] (03PS8) 10Arnaudb: mariadb: add instance metric polling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091190 (https://phabricator.wikimedia.org/T376596) [14:03:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2166.codfw.wmnet with OS bookworm [14:03:56] @EggRoll97 do you want to self-deploy or should I start with your change? [14:04:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2167.codfw.wmnet with OS bookworm [14:04:08] Could you? I'm fairly new at this whole thing. [14:04:29] sure, I can deploy [14:04:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2168.codfw.wmnet with OS bookworm [14:05:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2169.codfw.wmnet with OS bookworm [14:05:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2170.codfw.wmnet with OS bookworm [14:06:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) (owner: 10EggRoll97) [14:06:43] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1005.eqiad.wmnet with OS bookworm [14:07:05] (03Merged) 10jenkins-bot: enwiki: Add abusefilter-access-protected-vars to EFH/EFM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) (owner: 10EggRoll97) [14:07:22] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1092956|enwiki: Add abusefilter-access-protected-vars to EFH/EFM (T380332)]] [14:07:26] T380332: Add abusefilter-access-protected-vars to enwiki EFM/EFH - https://phabricator.wikimedia.org/T380332 [14:11:15] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1006.eqiad.wmnet with OS bookworm [14:11:37] !log sgimeno@deploy2002 eggroll97, sgimeno: Backport for [[gerrit:1092956|enwiki: Add abusefilter-access-protected-vars to EFH/EFM (T380332)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:55] EggRoll97: can you test please [14:12:14] yep [14:13:40] sergi0 lgtm [14:14:10] !log sgimeno@deploy2002 eggroll97, sgimeno: Continuing with sync [14:14:43] (03CR) 10Sergio Gimeno: [C:03+2] ExperimentUserDefaultsManager: use read latest when retrieving central id [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093889 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [14:16:59] (03CR) 10Ayounsi: Potential script to assign fr-tech server IPs and switch ports (0311 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [14:18:02] o/ [14:18:04] sorry, I was in a meeting [14:19:27] (03CR) 10Dbrant: [C:03+2] push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093919 (https://phabricator.wikimedia.org/T379647) (owner: 10Isabelle Hurbain-Palatin) [14:20:32] (03Merged) 10jenkins-bot: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093919 (https://phabricator.wikimedia.org/T379647) (owner: 10Isabelle Hurbain-Palatin) [14:21:12] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092956|enwiki: Add abusefilter-access-protected-vars to EFH/EFM (T380332)]] (duration: 13m 50s) [14:21:17] RECOVERY - Disk space on Hadoop worker on an-worker1109 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:21:23] T380332: Add abusefilter-access-protected-vars to enwiki EFM/EFH - https://phabricator.wikimedia.org/T380332 [14:21:31] EggRoll97: your change is synced [14:21:38] much appreciated [14:22:00] Lucas_WMDE: no worries, I'm waiting CI for my change [14:22:04] ok :) [14:22:15] I assume you’ll self-service ^^ [14:22:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2166.codfw.wmnet with reason: host reimage [14:23:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2167.codfw.wmnet with reason: host reimage [14:23:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2168.codfw.wmnet with reason: host reimage [14:24:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2169.codfw.wmnet with reason: host reimage [14:24:49] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2170.codfw.wmnet with reason: host reimage [14:25:11] !log ihurbain@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [14:25:27] (03CR) 10AOkoth: [C:03+1] vrts: Block bondedsender RBL check from spamassassin on vrts [puppet] - 10https://gerrit.wikimedia.org/r/1093905 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [14:25:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2166.codfw.wmnet with reason: host reimage [14:25:50] !log ihurbain@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [14:26:51] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol) [14:28:18] Lucas_WMDE: correct [14:28:29] 👍 [14:28:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2167.codfw.wmnet with reason: host reimage [14:31:07] (03CR) 10Andrew Bogott: "I am scared to merge this :)" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [14:31:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2170.codfw.wmnet with reason: host reimage [14:33:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2168.codfw.wmnet with reason: host reimage [14:34:18] PROBLEM - Kafka Broker Server #page on kafka-main1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:34:31] (03Merged) 10jenkins-bot: ExperimentUserDefaultsManager: use read latest when retrieving central id [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093889 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [14:34:41] effie ? [14:35:18] (03CR) 10Ssingh: "Please don't merge this yet :)" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [14:35:20] Backporting my change now [14:35:22] I hadn't resolved it because I was waiting on downtime to kick in [14:35:39] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1093889|ExperimentUserDefaultsManager: use read latest when retrieving central id (T379682)]] [14:35:43] T379682: Growth KPI Grafana dashboard claims no A/B testing happens at pilot wikis - https://phabricator.wikimedia.org/T379682 [14:36:39] (03CR) 10CDanis: [C:03+2] haproxy+requestctl: enable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1092914 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10344103 (10Papaul) @Clement_Goubert for 2140 you can re-image nothing else needs to be done on our end. I also update 2157. Let me know if you have any... [14:36:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2169.codfw.wmnet with reason: host reimage [14:38:19] (03CR) 10Ssingh: "To be clear, I don't mean all 2360 of course but some subset of those but most certainly within that subset." [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [14:38:37] claime: can we downtime the kafka-main1001 hosts, effie is not around. For how long? [14:39:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [14:39:10] FIRING: ProbeDown: Service restbase2021-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase2021-b:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:26] jynus: yeah go ahead and give it a week. [14:39:32] it'll be decom'd afaiu [14:39:43] ok, XioNoX giving it a week and resolving it manually [14:40:24] it is not a biggie but I didn't want to touch without at least a team member oking it [14:41:15] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1093889|ExperimentUserDefaultsManager: use read latest when retrieving central id (T379682)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:41:25] T379682: Growth KPI Grafana dashboard claims no A/B testing happens at pilot wikis - https://phabricator.wikimedia.org/T379682 [14:42:05] FIRING: [2x] ProbeDown: Service restbase2021-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2157.codfw.wmnet with OS bookworm [14:43:28] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on kafka-main1001.eqiad.wmnet with reason: Per claime's recommendation [14:43:42] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on kafka-main1001.eqiad.wmnet with reason: Per claime's recommendation [14:43:44] x) [14:46:51] (03PS1) 10Abijeet Patro: Fix layout broken by display:flex on HorizontalLayout [extensions/ContentTranslation] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093927 (https://phabricator.wikimedia.org/T380471) [14:47:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2166.codfw.wmnet with OS bookworm [14:47:48] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2140.codfw.wmnet with OS bookworm [14:48:46] !log sgimeno@deploy2002 Sync cancelled. [14:49:05] Seeing a bunch of not nice warnings from rdbms, aborting [14:49:09] ouch :( [14:49:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2167.codfw.wmnet with OS bookworm [14:49:21] log UTC afternoon deploys done [14:49:55] sergi0: I believe that’s missing a ! ^^ [14:50:18] !log UTC afternoon deploys done [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:32] Lucas_WMDE: my clipboard ate it :) [14:50:35] hehe [14:50:48] -bash: !log: event not found [14:51:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2170.codfw.wmnet with OS bookworm [14:52:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090968 (https://phabricator.wikimedia.org/T376597) (owner: 10Kgraessle) [14:53:03] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q2): Alert in need of triage: JobUnavailable - https://phabricator.wikimedia.org/T380022#10344151 (10tappof) The thumbor nodes have been deleted, but they are still listed in the Prometheus configuration, which is why they are triggering the alerts. The del... [14:53:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2168.codfw.wmnet with OS bookworm [14:53:48] sergi0: I think you also need to upload+merge a revert of that commit on wmf.4, otherwise it’ll be included in future deploys [14:53:53] (apologies if you’re already doing it) [14:54:36] !log gmodena@deploy2002 Started deploy [analytics/refinery@358ccf5]: Ad-hoc deployment [analytics/refinery@358ccf55] [14:55:00] oh no, I was missing that, is that what scap backport --revert does? [14:55:16] maybe? [14:55:25] but I’m not sure if that’s necessary if the original backport didn’t complete [14:55:34] jouncebot: next [14:55:34] In 1 hour(s) and 4 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1600) [14:55:42] I guess you might as well do it, there’s some time anyway [14:55:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [14:55:43] oh, gotya, I'll do it manually, thanks for pointing [14:56:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2169.codfw.wmnet with OS bookworm [14:56:02] do you know if scap reverted the change on the test / debug servers or if it’s still there? [14:56:10] (I’m hoping it said something about it ^^) [14:56:39] No, just Sync cancelled [14:56:39] We've UBN change 1093927 to deploy.. [14:57:01] hm [14:57:13] I would say, upload and merge the revert first [14:57:15] directly on Gerrit [14:57:22] (03PS1) 10Sergio Gimeno: Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093928 [14:57:23] and then kart_’s deploy will take care of resetting the mwdebug servers [14:57:33] (03PS2) 10Sergio Gimeno: Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093928 [14:57:38] open question: should we force-merge the revert to speed up that UBN change [14:57:41] or let it go through gate-and-submit [14:58:00] I’m tempted to force-merge it but that should not be done lightly [14:58:19] RECOVERY - Disk space on Hadoop worker on an-worker1143 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:59:08] (03CR) 10Sergio Gimeno: [C:03+2] Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093928 (owner: 10Sergio Gimeno) [15:00:02] Lucas_WMDE: I don't know the answer, not aware of all the implications of force-merge here, neither how to do it :/ [15:00:04] PROBLEM - Host wikikube-worker2140 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:15] let’s be on the safe side and not do it then [15:00:41] sergi0: Let me know if you're deploying. I'll start my scap in sometime (ie 15-20 minutes) [15:00:43] (03PS2) 10Lucas Werkmeister (WMDE): Fix layout broken by display:flex on HorizontalLayout [extensions/ContentTranslation] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093927 (https://phabricator.wikimedia.org/T380471) (owner: 10Abijeet Patro) [15:00:46] RECOVERY - Disk space on Hadoop worker on analytics1076 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:01:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "Should be good to deploy, but only after I1d56e03c29 (we don’t want the code being reverted there, with the warnings, to roll out). I adde" [extensions/ContentTranslation] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093927 (https://phabricator.wikimedia.org/T380471) (owner: 10Abijeet Patro) [15:01:32] kart_: yes I am, I will ping you when done [15:01:45] kart_: ^ I kicked off the gate-and-submit already [15:01:53] but with Depends-On to make sure the revert goes in first [15:01:59] I think you could already start running the scap backport now [15:02:04] Lucas_WMDE: nice. Thanks! [15:02:23] but now sergi0 is deploying it - can I run scap? [15:02:47] I think sergi0 isn’t deploying anything, right? only merging, in preparation for your deployment [15:03:04] oh sorry, I meant I will do it, now, still 15min ETA in CI [15:03:13] *no [15:03:29] now = no [15:03:35] jynus: I promise you, I believed I downtimed the thing for 10 days, I have no idea why it is abck [15:04:23] no worries, maybe it failed [15:04:28] I am not near a laptop, if anyone who can speaks alertmanager could have a go , I would appreciate it a lot [15:04:38] I do it [15:04:51] I did it already [15:04:55] ah thanks! [15:05:06] we are oncall precisely for that- i prefer false alarms than real alarms :-D [15:05:37] I only asked your team so I didn't do something wrong [15:05:37] Lucas_WMDE: OK. Then, I'll go ahead. [15:05:49] (03CR) 10Gergő Tisza: [C:03+1] Remove temporary fix for badly set CentralAuth cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 (owner: 10Bartosz Dziewoński) [15:06:20] !log gmodena@deploy2002 Finished deploy [analytics/refinery@358ccf5]: Ad-hoc deployment [analytics/refinery@358ccf55] (duration: 11m 44s) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:18] RECOVERY - Disk space on Hadoop worker on an-worker1112 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:07:20] sergi0: I guess you also need to revert the change on master btw? [15:07:23] Lucas_WMDE: `Change '1093927' has dependencies '[1093928]', which are not merged or scheduled for backport` [15:07:25] (or follow up in a different way before the branch cut) [15:07:40] kart_: try listing both changes in the scap backport command? [15:07:49] then we also have a record in SAL that the revert is getting deployed [15:08:48] cool [15:08:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093927 (https://phabricator.wikimedia.org/T380471) (owner: 10Abijeet Patro) [15:08:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093928 (owner: 10Sergio Gimeno) [15:08:55] yay [15:09:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10344210 (10VRiley-WMF) A 10G connection has been placed into port 25 and... [15:09:26] Lucas_WMDE: Indeed, a revert for now [15:09:31] ok [15:10:40] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2021.codfw.wmnet with reason: Decommissioning — T380236 [15:10:43] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2021.codfw.wmnet with reason: Decommissioning — T380236 [15:11:19] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [15:11:31] thanks urandom [15:12:46] RECOVERY - Disk space on Hadoop worker on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:15:02] RECOVERY - Host wikikube-worker2140 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [15:15:53] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:16:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:16:51] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q2): Alert in need of triage: JobUnavailable - https://phabricator.wikimedia.org/T380022#10344256 (10tappof) 05Open→03Resolved [15:16:57] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380478 (10phaultfinder) 03NEW [15:17:05] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380480 (10phaultfinder) 03NEW [15:17:13] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380482 (10phaultfinder) 03NEW [15:17:17] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380481 (10phaultfinder) 03NEW [15:17:46] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380482#10344297 (10phaultfinder) [15:18:09] (03PS2) 10Ssingh: hiera: set do_ipv6_primary_ra for all LVS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) [15:18:34] PROBLEM - Host wikikube-worker2140 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:45] (03CR) 10Ssingh: [C:03+2] hiera: set do_ipv6_primary_ra for all LVS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [15:20:20] RECOVERY - Host wikikube-worker2140 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [15:23:22] Lucas_WMDE: bad. some CI failure with GrowthExperiemt patch! [15:23:23] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:23:36] nooo :( [15:23:42] (oh that's on master) [15:23:46] (03Merged) 10jenkins-bot: Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093928 (owner: 10Sergio Gimeno) [15:23:47] (03Merged) 10jenkins-bot: Fix layout broken by display:flex on HorizontalLayout [extensions/ContentTranslation] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093927 (https://phabricator.wikimedia.org/T380471) (owner: 10Abijeet Patro) [15:23:48] ah phew ^^ [15:23:53] yay it just merged! [15:23:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:24:06] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1093927|Fix layout broken by display:flex on HorizontalLayout (T380471)]], [[gerrit:1093928|Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id"]] [15:24:10] T380471: Content Translation looks broken on desktop in multiple languages (November 21, 2024) - https://phabricator.wikimedia.org/T380471 [15:24:36] !log gmodena@deploy2002 Started deploy [analytics/refinery@358ccf5] (thin): Ad-hoc deployment THIN [analytics/refinery@358ccf55] [15:24:59] !log stop pybal on lvs2013 to confirm changes in CR 1091243 [15:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:45] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@6183645]: increase driver memory for mjolnir feature selection [15:25:57] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: rebooting [15:26:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: rebooting [15:26:16] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@6183645]: increase driver memory for mjolnir feature selection (duration: 00m 31s) [15:27:36] !log ihurbain@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [15:28:14] !log ihurbain@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [15:28:54] !log ihurbain@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [15:29:10] !log kartik@deploy2002 abi, sgimeno, kartik: Backport for [[gerrit:1093927|Fix layout broken by display:flex on HorizontalLayout (T380471)]], [[gerrit:1093928|Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:29:19] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:24] T380471: Content Translation looks broken on desktop in multiple languages (November 21, 2024) - https://phabricator.wikimedia.org/T380471 [15:29:25] sergi0: you want to test your patch on mwdebug? :) [15:29:40] doing [15:29:41] !log ihurbain@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [15:29:53] !log gmodena@deploy2002 Finished deploy [analytics/refinery@358ccf5] (thin): Ad-hoc deployment THIN [analytics/refinery@358ccf55] (duration: 05m 16s) [15:30:13] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487 (10SuzanneWood-WMDE) 03NEW [15:30:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:31:14] !log gmodena@deploy2002 Started deploy [analytics/refinery@358ccf5] (hadoop-test): Ad-hoc deployment TEST [analytics/refinery@358ccf55] [15:31:15] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10344403 (10SuzanneWood-WMDE) @WMDECyn could you please approve? [15:31:45] kart_: lgtm, maxConn warnings are gone [15:32:04] Nice! [15:32:25] I'm testing my patch.. few minutes.. [15:33:01] !log kartik@deploy2002 abi, sgimeno, kartik: Continuing with sync [15:33:31] RECOVERY - Disk space on Hadoop worker on an-worker1087 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:34:44] !log gmodena@deploy2002 Finished deploy [analytics/refinery@358ccf5] (hadoop-test): Ad-hoc deployment TEST [analytics/refinery@358ccf55] (duration: 03m 30s) [15:37:33] I need to leave in 5', kart_ are you ok handling the GE patch from here? [15:38:11] Yes. Both patches are being deployed together.. [15:39:09] Great, ty for the assistance! [15:39:37] Lucas_WMDE: thank you for the insights! [15:39:50] (03PS1) 10Elukey: Add the mapnik image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) [15:39:51] np :) [15:39:58] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093927|Fix layout broken by display:flex on HorizontalLayout (T380471)]], [[gerrit:1093928|Revert "ExperimentUserDefaultsManager: use read latest when retrieving central id"]] (duration: 15m 51s) [15:40:24] sergi0: done. You can verify if you're around! [15:40:26] T380471: Content Translation looks broken on desktop in multiple languages (November 21, 2024) - https://phabricator.wikimedia.org/T380471 [15:40:34] (03CR) 10Elukey: "Locally built with docker-pkg, all good." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) (owner: 10Elukey) [15:40:38] and Thanks Lucas_WMDE for all help. [15:42:21] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:07] RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [15:47:11] RECOVERY - Host lsw1-c4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms [15:47:35] RECOVERY - BGP status on lsw1-c4-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:35] RECOVERY - Juniper alarms on lsw1-c4-codfw.mgmt is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:48:25] PROBLEM - Host lsw1-c3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:41] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:49:39] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10344558 (10phaultfinder) [15:51:04] (03CR) 10Scott French: "Thanks, Reuven!" [alerts] - 10https://gerrit.wikimedia.org/r/1092906 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [15:51:08] (03CR) 10Scott French: [C:03+2] Move low-traffic consumer latency alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1092906 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [15:51:23] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:25] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:52:11] (03PS2) 10Elukey: Add the mapnik image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1093935 (https://phabricator.wikimedia.org/T327396) [15:52:55] (03Merged) 10jenkins-bot: Move low-traffic consumer latency alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1092906 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [15:53:53] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:59:02] RECOVERY - Host ps1-c3-codfw is UP: PING WARNING - Packet loss = 77%, RTA = 31.12 ms [15:59:08] RECOVERY - Host lsw1-c3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.63 ms [16:00:04] PROBLEM - Host wikikube-worker2140 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] andre and brennen: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1600) [16:00:50] !log dancy@deploy2002 Installing scap version "4.127.0" for 209 hosts [16:03:08] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2140.codfw.wmnet with OS bookworm [16:03:11] (03PS1) 10CDanis: Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) [16:03:55] (03PS2) 10CDanis: Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) [16:03:58] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [16:04:04] RECOVERY - Host wikikube-worker2140 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [16:04:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [16:05:52] !log dancy@deploy2002 Started scap sync-world: testing [16:08:54] !log dancy@deploy2002 Finished scap sync-world: testing (duration: 03m 01s) [16:11:55] btw, I made myself a cute lil command to display the currently held scap locks (without actually locking them like `scap lock --all` does), if anyone’s interested :3 [16:11:57] `jq --exit-status 'select(keys | length > 0) | ({"lock": input_filename} + .)' /var/lock/scap*; if [[ $? == 4 ]]; then printf '%s\n' 'No locks currently held.'; fi` [16:12:54] Feel free to make a modification to scap to make this a built-in behavior. [16:13:07] heh, maybe I will [16:13:10] `scap lock --list` or something [16:13:21] it was easier to build it up over time and just pull it from my shell history ^^ [16:13:35] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cluster=dnsbox,dc=magru [reason: testing] [16:14:57] ^ this was a wrong command, intentionally :P [16:16:41] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499 (10RobH) 03NEW [16:17:30] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10344684 (10RobH) [16:18:58] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10344687 (10RobH) a:03Andrew Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving... [16:20:05] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2157.codfw.wmnet with OS bookworm [16:21:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2157.codfw.wmnet with OS bookworm [16:22:13] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10344715 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm might have been caused by fixing a similar issue in C3. reterminated both ends of the patch for C4. Caused C3 to go down... [16:22:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10344754 (10Jhancock.wm) [16:23:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2140.codfw.wmnet with reason: host reimage [16:23:49] (03CR) 10Máté Szabó: "Removing -2 as we've introduced a config variable to control instrumentation in the ReportIncident extension." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [16:26:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2140.codfw.wmnet with reason: host reimage [16:29:10] FIRING: [2x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:30] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10344803 (10thcipriani) [16:32:35] 10ops-eqiad, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503 (10cmooney) 03NEW p:05Triage→03High The #Cloud-Services project tag is not intended to have any... [16:35:42] (03PS1) 10Jelto: gitlab: refactor check for ssh-gitlab in restore script [puppet] - 10https://gerrit.wikimedia.org/r/1093948 (https://phabricator.wikimedia.org/T380476) [16:35:55] 10ops-eqiad, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10344824 (10JJMC89) [16:39:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2157.codfw.wmnet with reason: host reimage [16:42:05] 10ops-eqiad, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10344918 (10dcaro) [16:43:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2157.codfw.wmnet with reason: host reimage [16:43:45] !log rebooting drained lvs2013 [16:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:54] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [16:46:22] (03CR) 10Fabfur: [C:03+1] Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [16:46:24] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1002" [16:46:38] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [16:46:38] PROBLEM - Host lvs2013 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:40] RECOVERY - Host lvs2013 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [16:46:45] this is ok ^ [16:46:49] not the meme one, but actually ok [16:46:59] 06SRE: Wikitech and Mediawiki sign in errors - https://phabricator.wikimedia.org/T380506 (10JLam-WMF) 03NEW [16:47:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1002" [16:47:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2140.codfw.wmnet with OS bookworm [16:47:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:47:32] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:47:52] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: rebooting [16:48:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: rebooting [16:51:30] 06SRE: Wikitech and Mediawiki sign in errors - https://phabricator.wikimedia.org/T380506#10345009 (10JLam-WMF) [16:52:38] 06SRE: Wikitech and Mediawiki sign in errors - https://phabricator.wikimedia.org/T380506#10345010 (10JLam-WMF) [16:53:02] 06SRE: Wikitech and Mediawiki sign in errors - https://phabricator.wikimedia.org/T380506#10345014 (10JLam-WMF) [16:54:02] !log enable puppet on lvs2013 and start pybal [16:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:43] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2013.codfw.wmnet [16:54:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2013.codfw.wmnet [16:55:28] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:55:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:55:32] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:59:32] 06SRE, 06DBA, 07Wikimedia-production-error: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076#10345067 (10Ladsgroup) 05Open→03Resolved The issue is over (and has been over for weeks now), we are changing how parsercache works drastically which would re... [17:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:03] (03PS1) 10Jelto: profile::auto_restarts::service: make restart time configurable [puppet] - 10https://gerrit.wikimedia.org/r/1093953 (https://phabricator.wikimedia.org/T380476) [17:02:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2157.codfw.wmnet with OS bookworm [17:04:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10345111 (10cmooney) >>! In T379790#10344210, @VRiley-WMF wrote: > A 10G c... [17:05:51] (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093956 (https://phabricator.wikimedia.org/T373037) [17:06:51] (03PS5) 10Bvibber: Enabling Charts on commons+test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091328 (https://phabricator.wikimedia.org/T379689) [17:08:21] (03PS1) 10Ssingh: hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) [17:08:23] (03PS1) 10Ssingh: LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) [17:08:58] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4566/console" [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [17:08:59] (03CR) 10CI reject: [V:04-1] hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [17:09:06] (03CR) 10CI reject: [V:04-1] LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [17:09:31] (03PS2) 10Ssingh: hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) [17:09:52] (03PS2) 10Ssingh: LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) [17:10:59] (03CR) 10JMeybohm: [C:03+2] wikikube-staging: put kubestage2003 and 2004 into production [puppet] - 10https://gerrit.wikimedia.org/r/1091783 (https://phabricator.wikimedia.org/T377011) (owner: 10Jasmine) [17:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:16:57] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380478#10345261 (10Jhancock.wm) had to reseat power cable to get alert to clear. [17:21:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10345291 (10phaultfinder) [17:22:14] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854#10345324 (10Andrew) [17:22:30] (03PS1) 10Jdlrobson: Enable Skin-Codex logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093960 (https://phabricator.wikimedia.org/T375287) [17:23:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093960 (https://phabricator.wikimedia.org/T375287) (owner: 10Jdlrobson) [17:23:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079640 (https://phabricator.wikimedia.org/T372165) (owner: 10Simon04) [17:25:21] (03PS8) 10Ebernhardson: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) [17:25:52] (03Abandoned) 10Ebernhardson: Update README and gitreview [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1093270 (owner: 10DCausse) [17:26:44] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380480#10345356 (10Jhancock.wm) tried reseating power cable. no change. reseated psu. no change. changed power cable. error specifically on psu2 cleared, but misc error has not. might require idrac reboot or psu replacemen... [17:26:59] (03CR) 10Ebernhardson: [V:03+2 C:03+2] "ps8 squashes the pre-patch to update .gitreview and README to match the new location. The package itself has now been released and is ava" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [17:27:14] (03PS1) 10Gergő Tisza: Set 'remember' central session object field when recreating [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093961 (https://phabricator.wikimedia.org/T379254) [17:27:31] (03PS1) 10Gergő Tisza: Use cookie to access central session when local session expired [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093962 [17:28:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093961 (https://phabricator.wikimedia.org/T379254) (owner: 10Gergő Tisza) [17:29:15] (03PS1) 10Andrew Bogott: Remove clouddb2002-dev from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1093963 (https://phabricator.wikimedia.org/T369308) [17:29:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093962 (owner: 10Gergő Tisza) [17:30:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092334 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [17:30:38] (03PS2) 10Jdlrobson: Enable Skin-Codex logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093960 (https://phabricator.wikimedia.org/T375287) [17:31:14] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts clouddb2002-dev.codfw.wmnet [17:33:09] (03PS12) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [17:33:18] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1093958/4570/" [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [17:33:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [17:35:39] (03CR) 10Vgutierrez: [C:04-2] "This will prevent the X-Requestctl header set by HAProxy from reaching Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:36:10] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [17:37:57] (03PS13) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [17:38:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [17:39:24] !log adding acls to kafka-jumbo cluster (T380373) [17:39:25] FIRING: [2x] SystemdUnitFailed: dragonfly-dfdaemon.service on kubestage2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:28] T380373: Allow TLS authenticated client to write on new topics - https://phabricator.wikimedia.org/T380373 [17:39:54] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [17:39:55] (03PS1) 10David Caro: horizon: bumping horizon to 2024-11-16-132956 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093967 (https://phabricator.wikimedia.org/T380511) [17:40:04] (03CR) 10Andrew Bogott: [C:03+2] Remove clouddb2002-dev from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1093963 (https://phabricator.wikimedia.org/T369308) (owner: 10Andrew Bogott) [17:40:49] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [17:40:49] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:40:50] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts clouddb2002-dev.codfw.wmnet [17:41:42] (03PS1) 10David Caro: horizon: bumping horizon to 2024-11-16-132956 globally [puppet] - 10https://gerrit.wikimedia.org/r/1093968 (https://phabricator.wikimedia.org/T380511) [17:42:24] (03PS3) 10CDanis: Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) [17:42:29] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:43:02] 10ops-codfw, 06cloud-services-team, 06Data-Persistence, 06DC-Ops, and 2 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10345520 (10Andrew) [17:44:25] RESOLVED: [2x] SystemdUnitFailed: dragonfly-dfdaemon.service on kubestage2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10345542 (10phaultfinder) [17:44:58] (03CR) 10Andrew Bogott: [C:03+1] horizon: bumping horizon to 2024-11-16-132956 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093967 (https://phabricator.wikimedia.org/T380511) (owner: 10David Caro) [17:45:37] (03CR) 10Vgutierrez: Move x-requestctl ingress scrub from Varnish to haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:45:41] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node check for host kubestage2003.codfw.wmnet [17:45:42] !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) check for host kubestage2003.codfw.wmnet [17:46:05] (03CR) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [17:46:09] 10ops-codfw, 06cloud-services-team, 06Data-Persistence, 06DC-Ops, and 2 others: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308#10345516 (10Andrew) a:05Andrew→03None This host is shut down and removed from puppet. It's only a couple of years old so should probably n... [17:46:24] (03CR) 10Fabfur: [C:03+2] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [17:48:00] (03PS4) 10CDanis: Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) [17:48:02] 10ops-eqiad, 06SRE, 10Cloud-Services, 06DC-Ops, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10345576 (10dcaro) [17:48:25] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:49:30] (03PS5) 10CDanis: Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) [17:49:36] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:50:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:50:53] huh [17:51:32] fabfur: ^ [17:51:57] Could not set 'file' on ensure: No such file or directory @ rb_sysopen - [17:54:08] (03PS1) 10Ssingh: Revert "haproxykafka: fix permissions on ssl files" [puppet] - 10https://gerrit.wikimedia.org/r/1093973 [17:54:41] (03PS1) 10JMeybohm: wikikube-staging: add kubestage2003 and 2004 to confctl [puppet] - 10https://gerrit.wikimedia.org/r/1093974 (https://phabricator.wikimedia.org/T377011) [17:54:52] (03CR) 10Vgutierrez: [C:03+1] Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:54:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459#10345600 (10Papaul) 05Resolved→03Open a:05Papaul→03Jhancock.wm Reopen this task since we are now seeing the error on DIMM B3. @Jhancock.wm since this server is out... [17:55:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:55:47] yes [17:56:20] (03CR) 10Andrew Bogott: [C:03+2] horizon: bumping horizon to 2024-11-16-132956 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093967 (https://phabricator.wikimedia.org/T380511) (owner: 10David Caro) [17:57:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459#10345617 (10Jhancock.wm) We have that on hand. @Vgutierrez (or anyone else in traffic) when is a good time to do this swap? [17:57:19] (03CR) 10CDanis: [C:03+2] Move x-requestctl ingress scrub from Varnish to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:57:29] (03CR) 10CDanis: [C:03+2] Move x-requestctl ingress scrub from Varnish to haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093940 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [17:58:22] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet [reason: DIMM failure T308459] [17:58:26] T308459: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 [17:58:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459#10345625 (10ssingh) Hi Jenn. The host has been depooled so you can do it whenever you want. Thanks! [17:58:55] (03PS1) 10Fabfur: haproxykafka: missing variable in merge [puppet] - 10https://gerrit.wikimedia.org/r/1093975 (https://phabricator.wikimedia.org/T379776) [17:59:22] (03PS14) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [17:59:33] (03CR) 10CI reject: [V:04-1] wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [18:00:05] bd808: How many deployers does it take to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1800). [18:00:05] cdanis and bvibber: #bothumor I � Unicode. All rise for Extension:Charts to commons + test2wiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1800). [18:00:14] hiii bvibber [18:00:14] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380478#10345627 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm T380480 [18:00:43] (03PS1) 10Varnent: Setting wmgUseTranslationMemory to false for Office Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093976 (https://phabricator.wikimedia.org/T380414) [18:00:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:01:10] (03CR) 10CI reject: [V:04-1] haproxykafka: missing variable in merge [puppet] - 10https://gerrit.wikimedia.org/r/1093975 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [18:01:20] * bd808 looks to see if he has things that can ship today [18:01:55] bd808: sorry, not sure I understand? [18:02:47] cdanis: oh, I have a standing window/reminder for this time on Thursdays to deploy updates to things like Toolhub and Developer portal. [18:03:12] oh lol I missed the first jouncebot message [18:03:24] (03PS2) 10Fabfur: haproxykafka: missing variable in merge [puppet] - 10https://gerrit.wikimedia.org/r/1093975 (https://phabricator.wikimedia.org/T379776) [18:03:29] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:35] ^ expected [18:03:38] depooled [18:04:08] cdanis: the bot spam all blends together :) [18:04:25] indeed :) [18:04:33] * bd808 has nothing to ship according to gerrit [18:04:40] \o/ [18:04:53] bvibber: we're just shipping the config change, right? no code backports or anything needed? [18:05:00] yeah just the config change [18:05:12] ok cool [18:05:14] do you want to run `scap backport` or shall I? [18:05:19] go ahead :D [18:05:39] (03CR) 10CI reject: [V:04-1] haproxykafka: missing variable in merge [puppet] - 10https://gerrit.wikimedia.org/r/1093975 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [18:05:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459#10345652 (10Jhancock.wm) replaced with a new DIMM. coming up now [18:05:52] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1091328 woot woot [18:06:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091328 (https://phabricator.wikimedia.org/T379689) (owner: 10Bvibber) [18:06:47] (03Merged) 10jenkins-bot: Enabling Charts on commons+test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091328 (https://phabricator.wikimedia.org/T379689) (owner: 10Bvibber) [18:06:52] (03CR) 10Fabfur: [C:03+2] Revert "haproxykafka: fix permissions on ssl files" [puppet] - 10https://gerrit.wikimedia.org/r/1093973 (owner: 10Ssingh) [18:07:02] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1091328|Enabling Charts on commons+test2 (T379689)]] [18:07:07] (03CR) 10JMeybohm: [C:03+2] wikikube-staging: add kubestage2003 and 2004 to confctl [puppet] - 10https://gerrit.wikimedia.org/r/1093974 (https://phabricator.wikimedia.org/T377011) (owner: 10JMeybohm) [18:07:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459#10345658 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [18:07:17] T379689: Deploy Charts to test2wiki + Commons - https://phabricator.wikimedia.org/T379689 [18:09:12] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cp2038.codfw.wmnet with reason: DIMM replacement in progress [18:09:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp2038.codfw.wmnet with reason: DIMM replacement in progress [18:09:47] (03CR) 10Andrew Bogott: [C:03+2] horizon: bumping horizon to 2024-11-16-132956 globally [puppet] - 10https://gerrit.wikimedia.org/r/1093968 (https://phabricator.wikimedia.org/T380511) (owner: 10David Caro) [18:10:26] !log sudo cumin -b11 'A:cp' 'run-puppet-agent [18:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:45] !log running puppet on A:cp to resolve failed puppet run [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:33] bvibber: I suspect it's already visible on testservers [18:11:42] ok lemme poke it [18:11:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10345667 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:12:35] !log cdanis@deploy2002 cdanis, bvibber: Backport for [[gerrit:1091328|Enabling Charts on commons+test2 (T379689)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:12:39] T379689: Deploy Charts to test2wiki + Commons - https://phabricator.wikimedia.org/T379689 [18:13:27] looks good on commons [18:13:48] cdanis: do it! [18:13:52] !log cdanis@deploy2002 cdanis, bvibber: Continuing with sync [18:15:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10345674 (10Jhancock.wm) @Papaul this one server fails after the puppet certificate is generated. Can you take a look. I'll check out the other servers... [18:15:53] once it's fully lice i can test from test2 [18:15:56] !log jayme@cumin2002 conftool action : set/weight=10; selector: name=kubestage200[34].codfw.wmnet [18:16:01] (since it bounces internally you can't test the remote on mwdebug) [18:16:10] *live not lice [18:16:43] !log jayme@cumin2002 conftool action : set/pooled=yes; selector: name=kubestage200[34].codfw.wmnet [18:17:29] bvibber: huh, you know, maybe the debug hosts should always point to themselves for that kind of thing [18:17:45] hmmm [18:17:55] that's a normative should, an aspirational should [18:17:55] we do have some config way i think we can make that happen [18:17:58] hehe [18:18:01] almost! [18:18:05] we're lacking it for some use cases [18:18:12] like for hitting commons API we actually don't use the internal endpoint [18:18:35] https://phabricator.wikimedia.org/T368064 [18:20:01] this has been a long deploy [18:21:08] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091328|Enabling Charts on commons+test2 (T379689)]] (duration: 14m 05s) [18:21:16] T379689: Deploy Charts to test2wiki + Commons - https://phabricator.wikimedia.org/T379689 [18:21:32] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523 (10mmartorana) 03NEW [18:21:53] (03PS3) 10Scott French: mediawiki: support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081449 (https://phabricator.wikimedia.org/T377040) [18:21:55] (03PS3) 10Scott French: mw-api-int: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081450 (https://phabricator.wikimedia.org/T377040) [18:21:56] (03PS3) 10Scott French: mw-api-int: remove "migration" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081452 (https://phabricator.wikimedia.org/T377040) [18:21:57] (03PS3) 10Scott French: mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) [18:21:58] (03PS3) 10Scott French: mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040) [18:22:01] ok lemme test on test2 [18:22:32] https://test2.wikipedia.org/wiki/User:Brooke_Vibber_(WMF)/chart_test \o/ [18:22:49] nice [18:24:33] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380480#10345761 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted idrac. up [18:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10345771 (10phaultfinder) [18:25:25] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380482#10345768 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cable and rebooted idrac. all alerts cleared. [18:27:15] (03PS3) 10Scott French: hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040) [18:27:16] (03PS2) 10Scott French: hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040) [18:27:27] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380481#10345774 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm replaced. errors on bmc cleared. [18:28:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525 (10Jly) 03NEW [18:29:40] (03CR) 10CDanis: [C:03+1] haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:30:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10345791 (10Jhancock.wm) 05Open→03Resolved [18:30:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10345820 (10Jhancock.wm) [18:32:36] 06SRE, 10SRE-Access-Requests: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10345832 (10Jly) [18:40:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:43:05] (03PS1) 10Bvibber: Follow-up fix for Charts enable on commons/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093983 (https://phabricator.wikimedia.org/T379689) [18:43:15] !log gmodena@deploy2002 Started deploy [analytics/refinery@199401a]: Ad-hoc deployment [analytics/refinery@199401a6] [18:43:50] cdanis: ok if we deploy that next we should be good :D [18:43:52] (03CR) 10SBassett: [C:03+1] "There's precedent for this, so from that standpoint I can +1. But someone from Privacy Engineer should also review this patch and perform" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier) [18:44:27] (03CR) 10CDanis: [C:03+2] Follow-up fix for Charts enable on commons/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093983 (https://phabricator.wikimedia.org/T379689) (owner: 10Bvibber) [18:44:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093983 (https://phabricator.wikimedia.org/T379689) (owner: 10Bvibber) [18:44:53] \o/ [18:45:12] (03Merged) 10jenkins-bot: Follow-up fix for Charts enable on commons/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093983 (https://phabricator.wikimedia.org/T379689) (owner: 10Bvibber) [18:45:29] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1093983|Follow-up fix for Charts enable on commons/test2 (T379689)]] [18:45:33] T379689: Deploy Charts to test2wiki + Commons - https://phabricator.wikimedia.org/T379689 [18:46:14] (03PS1) 10DDesouza: Deploy Reader Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 [18:46:40] (03Abandoned) 10Gergő Tisza: Add 'auth' wiki tag when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091922 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [18:47:29] bvibber: cool, it's fixed on mwdebug now :D [18:47:30] (03CR) 10CI reject: [V:04-1] Deploy Reader Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (owner: 10DDesouza) [18:47:34] \o/ [18:47:41] awesome sauce [18:47:42] I had to purge cache I think [18:47:50] ?action=purge that is [18:47:51] yeah it has to re-render the page once [18:47:58] so an edit or purge will fix it [18:49:02] yep, purge fixed it on commons too [18:49:06] I'll proceed as soon as scap prompts [18:49:25] !log cdanis@deploy2002 cdanis, bvibber: Backport for [[gerrit:1093983|Follow-up fix for Charts enable on commons/test2 (T379689)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:49:27] !log cdanis@deploy2002 cdanis, bvibber: Continuing with sync [18:49:54] (03Abandoned) 10DDesouza: Undeploy Annual Plan Core Metrics beta survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984287 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [18:50:25] (03Abandoned) 10DDesouza: Reorganize QuickSurveys config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984286 (owner: 10DDesouza) [18:56:58] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093983|Follow-up fix for Charts enable on commons/test2 (T379689)]] (duration: 11m 29s) [18:57:05] T379689: Deploy Charts to test2wiki + Commons - https://phabricator.wikimedia.org/T379689 [18:57:14] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] vrts: Block bondedsender RBL check from spamassassin on vrts [puppet] - 10https://gerrit.wikimedia.org/r/1093905 (https://phabricator.wikimedia.org/T380396) (owner: 10EoghanGaffney) [18:57:24] !log gmodena@deploy2002 Finished deploy [analytics/refinery@199401a]: Ad-hoc deployment [analytics/refinery@199401a6] (duration: 14m 08s) [19:00:05] andre and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1900). [19:00:17] nah. [19:01:27] (03PS1) 10Fabfur: haproxykafka: enable ssl authentication [puppet] - 10https://gerrit.wikimedia.org/r/1093990 (https://phabricator.wikimedia.org/T379776) [19:01:33] !log gmodena@deploy2002 Started deploy [analytics/refinery@199401a] (thin): Ad-hoc deployment THIN [analytics/refinery@199401a6] [19:07:11] !log gmodena@deploy2002 Finished deploy [analytics/refinery@199401a] (thin): Ad-hoc deployment THIN [analytics/refinery@199401a6] (duration: 05m 37s) [19:08:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093990 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [19:19:20] (03PS3) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [19:21:24] (03CR) 10CI reject: [V:04-1] Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [19:23:36] (03PS4) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [19:27:51] !log gmodena@deploy2002 Started deploy [analytics/refinery@199401a] (hadoop-test): Ad-hoc deployment TEST [analytics/refinery@199401a6] [19:28:24] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (owner: 10DDesouza) [19:28:31] (03CR) 10Cathal Mooney: "Thanks for the review! Took most of the things on-board. I did hope I could use more of the common.py stuff originally, and tried a litt" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [19:29:08] 06SRE, 10SRE-Access-Requests: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10346117 (10sbassett) [19:29:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10346119 (10sbassett) [19:29:35] (03PS2) 10DDesouza: Deploy Reader Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) [19:30:07] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10346123 (10sbassett) [19:30:18] (03CR) 10CI reject: [V:04-1] Deploy Reader Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [19:30:57] 06SRE, 10SRE-Access-Requests: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10346127 (10sbassett) [19:31:31] (03PS5) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [19:31:36] !log gmodena@deploy2002 Finished deploy [analytics/refinery@199401a] (hadoop-test): Ad-hoc deployment TEST [analytics/refinery@199401a6] (duration: 03m 45s) [19:32:42] (03PS3) 10DDesouza: Deploy Reader Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) [19:34:14] jouncebot: nowandnext [19:34:14] For the next 1 hour(s) and 25 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T1900) [19:34:14] In 1 hour(s) and 25 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T2100) [19:39:19] (03PS1) 10Bvibber: Add tracking categories for {{#chart:}} usage [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094000 (https://phabricator.wikimedia.org/T369684) [19:40:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094000 (https://phabricator.wikimedia.org/T369684) (owner: 10Bvibber) [19:45:44] (03PS4) 10DDesouza: Reader Survey: Deploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) [19:46:33] (03PS1) 10DDesouza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094001 (https://phabricator.wikimedia.org/T378660) [19:55:33] (03CR) 10Reedy: "Just to note, this isn't production enwiki (if that is what was intended)..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [20:08:02] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10346245 (10VRiley-WMF) This drive has been replaced. Please let us know if anything else is needed. [20:14:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093497 (owner: 10Bartosz Dziewoński) [20:16:53] RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [20:19:21] PROBLEM - Webrequests Varnishkafka log producer on cp2038 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:20:45] (03PS1) 10CDobbins: P:hardware::noop: test custom report processor [puppet] - 10https://gerrit.wikimedia.org/r/1094004 [20:20:57] !log force agent on cp2038 [20:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:21] RECOVERY - Webrequests Varnishkafka log producer on cp2038 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:21:30] (03PS1) 10Bvibber: Add statsv to charts impressions [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094005 (https://phabricator.wikimedia.org/T379833) [20:21:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094005 (https://phabricator.wikimedia.org/T379833) (owner: 10Bvibber) [20:21:52] (03CR) 10CI reject: [V:04-1] P:hardware::noop: test custom report processor [puppet] - 10https://gerrit.wikimedia.org/r/1094004 (owner: 10CDobbins) [20:24:08] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet [reason: DIMM replaced, T308459] [20:24:14] T308459: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 [20:25:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459#10346277 (10ssingh) 05Open→03Resolved [20:32:05] FIRING: [2x] ProbeDown: Service restbase2021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:55] (03PS1) 10Andrew Bogott: puppet-enc: use project names rather than IDs in instance-puppet git [puppet] - 10https://gerrit.wikimedia.org/r/1094015 (https://phabricator.wikimedia.org/T379128) [20:40:38] (03CR) 10CI reject: [V:04-1] puppet-enc: use project names rather than IDs in instance-puppet git [puppet] - 10https://gerrit.wikimedia.org/r/1094015 (https://phabricator.wikimedia.org/T379128) (owner: 10Andrew Bogott) [20:41:44] (03PS2) 10Andrew Bogott: puppet-enc: use project names rather than IDs in instance-puppet git [puppet] - 10https://gerrit.wikimedia.org/r/1094015 (https://phabricator.wikimedia.org/T379128) [20:44:20] (03CR) 10Andrew Bogott: [C:03+2] puppet-enc: use project names rather than IDs in instance-puppet git [puppet] - 10https://gerrit.wikimedia.org/r/1094015 (https://phabricator.wikimedia.org/T379128) (owner: 10Andrew Bogott) [20:46:38] !log T378289 running mwscript-k8s -f --comment="T378289" -- extensions/CentralAuth/maintenance/attachAccount.php --wiki=testwiki --wiki-user-list T378289-accounts-to-attach.tsv [20:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:42] T378289: SUL accounts with unattached Wikitech accounts auto-creating unattached accounts on other wikis - https://phabricator.wikimedia.org/T378289 [20:48:12] (03CR) 10Scott French: "Thanks, Hugh! Made a first pass - let me know if anything's unclear." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [20:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10346385 (10phaultfinder) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241121T2100). [21:00:05] katherine_g, Jdlrobson, tgr, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] here [21:00:14] o/ [21:00:24] o/ [21:01:44] o/ [21:02:29] o/ i can deploy. [21:02:39] * cjming bows to brennen [21:02:40] tx [21:03:14] starting at the top with katherine_g's config change [21:03:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090968 (https://phabricator.wikimedia.org/T376597) (owner: 10Kgraessle) [21:05:19] (03Merged) 10jenkins-bot: Enable AutoModerator on afwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090968 (https://phabricator.wikimedia.org/T376597) (owner: 10Kgraessle) [21:05:35] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1090968|Enable AutoModerator on afwiki (T376597)]] [21:05:39] T376597: Enable AutoModerator on afwiki - https://phabricator.wikimedia.org/T376597 [21:07:50] ok i'm good to sync [21:07:54] goin' [21:08:46] well, will hit test servers shortly - we'll get a ping [21:09:48] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10346406 (10Ladsgroup) My plan is to start a 16 parallel cleaners for commons thumbnails, the first one doing the clean up on containers ending with 0 (00, 10, 20, ..., f0), the secon... [21:10:57] !log brennen@deploy2002 kgraessle, brennen: Backport for [[gerrit:1090968|Enable AutoModerator on afwiki (T376597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:11:01] T376597: Enable AutoModerator on afwiki - https://phabricator.wikimedia.org/T376597 [21:11:26] tgr|away: are these CentralAuth changes safe to bundle? i should probably get those going... [21:11:33] brennen: yes [21:12:02] (03CR) 10Brennen Bearnes: [C:03+2] Use cookie to access central session when local session expired [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093962 (owner: 10Gergő Tisza) [21:12:17] katherine_g: any testing to do here? [21:12:35] nope [21:12:50] !log brennen@deploy2002 kgraessle, brennen: Continuing with sync [21:13:11] ty [21:13:33] Jdlrobson: you'll be up here shortly, getting the first one going [21:14:05] (03CR) 10Brennen Bearnes: [C:03+2] Enable Skin-Codex logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093960 (https://phabricator.wikimedia.org/T375287) (owner: 10Jdlrobson) [21:14:51] (03Merged) 10jenkins-bot: Enable Skin-Codex logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093960 (https://phabricator.wikimedia.org/T375287) (owner: 10Jdlrobson) [21:15:15] sounds good! [21:16:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:26] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090968|Enable AutoModerator on afwiki (T376597)]] (duration: 13m 50s) [21:19:35] (03CR) 10Brennen Bearnes: [C:03+2] Set 'remember' central session object field when recreating [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093961 (https://phabricator.wikimedia.org/T379254) (owner: 10Gergő Tisza) [21:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10346431 (10phaultfinder) [21:19:47] T376597: Enable AutoModerator on afwiki - https://phabricator.wikimedia.org/T376597 [21:20:44] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1093960|Enable Skin-Codex logging (T375287)]] [21:21:09] T375287: Identify pages that rely on Codex message box but do not explicitly add Codex styles to the page - https://phabricator.wikimedia.org/T375287 [21:21:14] (03PS1) 10Andrew Bogott: puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 [21:21:55] (03CR) 10CI reject: [V:04-1] puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 (owner: 10Andrew Bogott) [21:24:50] (03PS2) 10Andrew Bogott: puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 [21:25:32] (03CR) 10CI reject: [V:04-1] puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 (owner: 10Andrew Bogott) [21:26:26] !log brennen@deploy2002 brennen, jdlrobson: Backport for [[gerrit:1093960|Enable Skin-Codex logging (T375287)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:30] T375287: Identify pages that rely on Codex message box but do not explicitly add Codex styles to the page - https://phabricator.wikimedia.org/T375287 [21:26:35] Jdlrobson: anything to test? [21:27:53] brennen: I dont think so [21:28:03] not sure how logstash works, but I assume I would verify it's working post merge [21:29:10] You could test it with X-Wikimedia-Debug I suppose. Not really worth it though, it's a trivial change. [21:29:20] 👍 [21:29:40] goin' [21:29:42] !log brennen@deploy2002 brennen, jdlrobson: Continuing with sync [21:29:57] yeh it works [21:30:05] i just verified it on X-Wikimedia-Debug [21:30:11] (03CR) 10Brennen Bearnes: [C:03+2] Reduce number of bucketsizes for MediaViewer (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079640 (https://phabricator.wikimedia.org/T372165) (owner: 10Simon04) [21:30:15] sweet [21:30:17] (03Merged) 10jenkins-bot: Set 'remember' central session object field when recreating [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093961 (https://phabricator.wikimedia.org/T379254) (owner: 10Gergő Tisza) [21:30:22] (03Merged) 10jenkins-bot: Use cookie to access central session when local session expired [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093962 (owner: 10Gergő Tisza) [21:30:23] bucketsizes next, once this finishes. [21:31:08] (03Merged) 10jenkins-bot: Reduce number of bucketsizes for MediaViewer (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079640 (https://phabricator.wikimedia.org/T372165) (owner: 10Simon04) [21:31:16] ...aaaand might have tgr|away check centralauth stuff same time, if possible. [21:31:36] brennen: im all ready to QA Simon04's patch. Should be a quick one :) [21:31:45] (03PS3) 10Andrew Bogott: puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 [21:32:26] (03CR) 10CI reject: [V:04-1] puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 (owner: 10Andrew Bogott) [21:33:58] (03PS4) 10Andrew Bogott: puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 (https://phabricator.wikimedia.org/T379128) [21:35:01] (03CR) 10Andrew Bogott: [C:03+2] puppet-enc: wider use of project names instead of project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1094033 (https://phabricator.wikimedia.org/T379128) (owner: 10Andrew Bogott) [21:36:38] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093960|Enable Skin-Codex logging (T375287)]] (duration: 15m 53s) [21:36:43] T375287: Identify pages that rely on Codex message box but do not explicitly add Codex styles to the page - https://phabricator.wikimedia.org/T375287 [21:38:59] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1079640|Reduce number of bucketsizes for MediaViewer (group0) (T372165)]], [[gerrit:1093961|Set 'remember' central session object field when recreating (T379254 T372702)]], [[gerrit:1093962|Use cookie to access central session when local session expired]] [21:39:06] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:39:06] T379254: centralauth_Token cookie not set on top-level autologin - https://phabricator.wikimedia.org/T379254 [21:39:06] T372702: editors are repeatedly getting logged out (August 2024) - https://phabricator.wikimedia.org/T372702 [21:42:48] !log brennen@deploy2002 brennen, tgr, simon04: Backport for [[gerrit:1079640|Reduce number of bucketsizes for MediaViewer (group0) (T372165)]], [[gerrit:1093961|Set 'remember' central session object field when recreating (T379254 T372702)]], [[gerrit:1093962|Use cookie to access central session when local session expired]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:44:16] Jdlrobson, tgr|away: i await your approval [21:46:53] brennen: on it [21:47:07] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [21:49:28] brennen: oh no the merge strategy issue strikes again [21:50:01] ? [21:50:04] On group0 for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1079640/2/wmf-config/InitialiseSettings.php#5823 I'm seeing `[320, 800, 1024, 1280, 1280, 1920, 2560, 2560, 2880]` rather than `[ 1280, 2560 ]` [21:50:24] e.g. instead of replacing the array it's merging it with the default value [21:50:31] so please undo that one. [21:50:57] k. discontinuing sync, will prep a revert for that one. [21:51:01] !log brennen@deploy2002 Sync cancelled. [21:51:12] thanks ! I'll leave a comment [21:51:33] brennen: mine looks good [21:52:01] tgr|away: ack, will sync those shortly [21:52:24] (03CR) 10Jdlrobson: [C:03+1] Reduce number of bucketsizes for MediaViewer (group0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079640 (https://phabricator.wikimedia.org/T372165) (owner: 10Simon04) [21:52:36] (03PS1) 10Brennen Bearnes: Revert "Reduce number of bucketsizes for MediaViewer (group0)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094047 (https://phabricator.wikimedia.org/T372165) [21:53:32] Jdlrobson: I think the patch itself is fine [21:53:40] the bug is in the extension [21:53:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094047 (https://phabricator.wikimedia.org/T372165) (owner: 10Brennen Bearnes) [21:54:29] (03Merged) 10jenkins-bot: Revert "Reduce number of bucketsizes for MediaViewer (group0)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094047 (https://phabricator.wikimedia.org/T372165) (owner: 10Brennen Bearnes) [21:54:47] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1094047|Revert "Reduce number of bucketsizes for MediaViewer (group0)" (T372165)]] [21:54:51] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:58:17] (03PS15) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [21:58:17] (03PS6) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [21:58:17] (03PS5) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [21:58:35] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [21:58:38] !log brennen@deploy2002 brennen: Backport for [[gerrit:1094047|Revert "Reduce number of bucketsizes for MediaViewer (group0)" (T372165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:58:41] !log brennen@deploy2002 brennen: Continuing with sync [21:59:41] i'm ok going over the window here for these last 3. [21:59:45] (03PS16) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [21:59:45] (03PS7) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [21:59:45] (03PS6) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [21:59:47] ok [22:00:00] i got noplace else to be on my end :D [22:00:50] thx [22:03:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10346623 (10Papaul) 05Open→03Resolved This is fix [22:05:21] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094047|Revert "Reduce number of bucketsizes for MediaViewer (group0)" (T372165)]] (duration: 10m 34s) [22:05:25] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [22:05:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092334 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:06:27] (03Merged) 10jenkins-bot: Disable various extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092334 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:06:47] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1092334|Disable various extensions when using the shared login domain (T373737)]] [22:06:50] (03CR) 10Brennen Bearnes: [C:03+2] Add tracking categories for {{#chart:}} usage [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094000 (https://phabricator.wikimedia.org/T369684) (owner: 10Bvibber) [22:06:51] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [22:06:57] \o/ [22:07:09] bvibber: getting this one started, will sync it after tgr|away's config patch [22:07:15] coolio [22:08:08] (03PS8) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [22:08:08] (03PS7) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [22:08:53] (03PS8) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [22:08:54] (03PS9) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [22:08:56] (03CR) 10CI reject: [V:04-1] wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [22:09:45] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:10:23] !log brennen@deploy2002 tgr, brennen: Backport for [[gerrit:1092334|Disable various extensions when using the shared login domain (T373737)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:10:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10346672 (10Papaul) @Jhancock.wm it looks like again the server did send the puppet cert request to the wrong puppet server. it supposed to send the req... [22:11:28] tgr|away: good on this one? [22:17:53] (03Merged) 10jenkins-bot: Add tracking categories for {{#chart:}} usage [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094000 (https://phabricator.wikimedia.org/T369684) (owner: 10Bvibber) [22:18:10] it's mostly a beta change. Either not working or it was already broken before on beta, which is always an option (didn't think to check). In any cause, doesn't seem to have caused any issues in production so I think it's good to go [22:18:18] cool cool [22:18:20] !log brennen@deploy2002 tgr, brennen: Continuing with sync [22:20:20] (03CR) 10Ryan Kemper: "sorry this comment is probably too longwinded to understand :P" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:22:20] (03PS5) 10DDesouza: Reader Survey: Deploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) [22:22:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [22:23:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10346724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm [22:24:12] (03CR) 10DDesouza: "@reedy@wikimedia.org Thanks! It escaped my attention. 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093987 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [22:25:03] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092334|Disable various extensions when using the shared login domain (T373737)]] (duration: 18m 16s) [22:25:08] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [22:25:31] (03CR) 10Brennen Bearnes: [C:03+2] Add statsv to charts impressions [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094005 (https://phabricator.wikimedia.org/T379833) (owner: 10Bvibber) [22:25:40] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1094000|Add tracking categories for {{#chart:}} usage (T369684)]] [22:25:44] T369684: Add tracking category for pages with Charts - https://phabricator.wikimedia.org/T369684 [22:25:57] \o/ [22:31:05] (03PS1) 10DDesouza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) [22:31:19] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10346774 (10herron) `cumin1002:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 50 --os bookworm --cluster codfw --group A -t T378988 aux-k8s-etcd2003 cookbook [GLOBAL_ARGS] sre.ganet... [22:31:25] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10346775 (10herron) [22:32:06] (03PS2) 10DDesouza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094054 (https://phabricator.wikimedia.org/T378660) [22:32:07] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd2003.codfw.wmnet [22:32:08] !log herron@cumin1002 START - Cookbook sre.dns.netbox [22:32:27] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:32:30] (03Abandoned) 10DDesouza: Reader Survey: Increase coverage on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094001 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [22:34:06] bvibber: should get a ping here when that first one is ready for testing. it's taking longer than usual because of language rebuilds, i believe. [22:34:18] cool [22:34:39] patience is a valuable commodity with deploys, there's always something taking longer than ya think ;) [22:35:31] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd2003.codfw.wmnet - herron@cumin1002" [22:35:36] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd2003.codfw.wmnet - herron@cumin1002" [22:35:36] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:35:36] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd2003.codfw.wmnet on all recursors [22:35:39] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd2003.codfw.wmnet on all recursors [22:36:06] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd2003.codfw.wmnet - herron@cumin1002" [22:36:11] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd2003.codfw.wmnet - herron@cumin1002" [22:38:06] (03Merged) 10jenkins-bot: Add statsv to charts impressions [extensions/Chart] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1094005 (https://phabricator.wikimedia.org/T379833) (owner: 10Bvibber) [22:38:06] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-etcd2003.codfw.wmnet with OS bookworm [22:38:16] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10346792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-etcd2003.codfw.wmnet with OS bookworm [22:40:52] !log brennen@deploy2002 bvibber, brennen: Backport for [[gerrit:1094000|Add tracking categories for {{#chart:}} usage (T369684)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:40:52] !log brennen@deploy2002 Sync cancelled. [22:40:56] T369684: Add tracking category for pages with Charts - https://phabricator.wikimedia.org/T369684 [22:41:13] dammit. [22:41:31] brennen: looks functional [22:41:33] uh-oh [22:41:39] :D [22:42:31] (03PS17) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [22:42:31] (03PS9) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [22:42:32] (03PS10) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [22:42:41] !log brennen@deploy2002 Started scap sync-world: resuming sync for [[gerrit:1094000|Add tracking categories for {{#chart:}} usage (T369684)]] after messing up a keypress [22:42:46] i... inopportunely pressed enter on a default "N". resuming a sync-world. [22:42:47] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:43:00] this should be quicker. [22:43:21] whee [22:43:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:43:50] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:44:18] (03CR) 10Cathal Mooney: WIP: example config for Nokia SR-Linux (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [22:46:03] (03CR) 10Cathal Mooney: WIP: example config for Nokia SR-Linux (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [22:47:37] brennen: the CentralAuth patches are live now, right? [22:48:33] tgr|away: yep [22:49:02] cool, thx for the deploys! [22:49:58] sure thing [22:51:50] (03PS18) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [22:51:50] (03PS10) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [22:51:50] (03PS11) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [22:52:19] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:52:37] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-etcd2003.codfw.wmnet with reason: host reimage [22:54:45] !log brennen@deploy2002 Finished scap sync-world: resuming sync for [[gerrit:1094000|Add tracking categories for {{#chart:}} usage (T369684)]] after messing up a keypress (duration: 12m 35s) [22:54:49] T369684: Add tracking category for pages with Charts - https://phabricator.wikimedia.org/T369684 [22:55:02] bvibber: ok, last one [22:55:18] Woohoo [22:55:38] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-etcd2003.codfw.wmnet with reason: host reimage [22:55:45] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1094005|Add statsv to charts impressions (T379833)]] [22:55:49] T379833: Add basic instrumentation for when a chart is viewed - https://phabricator.wikimedia.org/T379833 [22:56:28] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [22:58:06] (03CR) 10Bking: [C:03+1] "We've reduced the scope of this patch so we're only creating the Puppet roles. LVS-related config in subsequent patches." [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [23:00:02] !log brennen@deploy2002 bvibber, brennen: Backport for [[gerrit:1094005|Add statsv to charts impressions (T379833)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:00:31] brennen: looks good [23:01:04] !log brennen@deploy2002 bvibber, brennen: Continuing with sync [23:01:07] cool, syncing. [23:01:19] whee [23:06:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm [23:06:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10346871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed... [23:07:53] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094005|Add statsv to charts impressions (T379833)]] (duration: 12m 08s) [23:07:58] T379833: Add basic instrumentation for when a chart is viewed - https://phabricator.wikimedia.org/T379833 [23:08:45] \o/ [23:08:46] !log end of utc late backport & config window [23:08:47] thanks brennen ! [23:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:54] sure thing [23:09:40] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-etcd2003.codfw.wmnet with OS bookworm [23:09:40] !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd2003.codfw.wmnet [23:09:48] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10346885 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-etcd2003.codfw.wmnet with OS bookworm completed: - aux-k8s-etcd2003 (**PASS**) - R... [23:11:19] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd2004.codfw.wmnet [23:11:20] !log herron@cumin1002 START - Cookbook sre.dns.netbox [23:11:55] (03PS19) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [23:11:55] (03PS11) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [23:11:55] (03PS12) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [23:11:56] (03PS1) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [23:24:51] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd2004.codfw.wmnet - herron@cumin1002" [23:28:53] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd2004.codfw.wmnet - herron@cumin1002" [23:28:53] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:28:53] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd2004.codfw.wmnet on all recursors [23:28:57] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd2004.codfw.wmnet on all recursors [23:29:23] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd2004.codfw.wmnet - herron@cumin1002" [23:29:27] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd2004.codfw.wmnet - herron@cumin1002" [23:32:25] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [23:34:36] (03PS2) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [23:36:36] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-etcd2004.codfw.wmnet with OS bookworm [23:36:42] 06SRE, 10vm-requests, 07Kubernetes: codfw: (3x) aux-k8s-etcd nodes - https://phabricator.wikimedia.org/T378988#10346944 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-etcd2004.codfw.wmnet with OS bookworm [23:41:56] (03PS13) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [23:41:56] (03PS3) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [23:52:54] !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-etcd2004.codfw.wmnet with reason: host reimage [23:56:24] !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-etcd2004.codfw.wmnet with reason: host reimage [23:58:55] (03PS14) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [23:58:55] (03PS4) 10Ryan Kemper: wdqs-internal: Add graph split svcs to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1094061 (https://phabricator.wikimedia.org/T380555) [23:58:55] (03PS1) 10Ryan Kemper: wdqs-internal: configure lvs IPs for backends [puppet] - 10https://gerrit.wikimedia.org/r/1094069 (https://phabricator.wikimedia.org/T380555) [23:58:56] (03PS1) 10Ryan Kemper: wdqs-internal: configure graphsplit load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1094070 (https://phabricator.wikimedia.org/T380555)