[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157 (owner: 10TrainBranchBot) [00:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150803 [00:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150803 (owner: 10TrainBranchBot) [00:18:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:33] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:20:35] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:20:36] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:20:38] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:20:39] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:20:42] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:28:23] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150803 (owner: 10TrainBranchBot) [00:44:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/66ba988a16aab42f0caf5f9631dd66353d911a3adaffb67de810b2291cb2bc3a/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:46:16] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:04:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.3 [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1150828 (https://phabricator.wikimedia.org/T392173) [01:08:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.3 [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1150828 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [01:20:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.3 [core] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1150828 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T0200) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T0300) [03:01:41] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150855 (https://phabricator.wikimedia.org/T392173) [03:01:42] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150855 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [03:02:28] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150855 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [03:02:47] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.3 refs T392173 [03:02:51] T392173: 1.45.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T392173 [03:15:03] (03CR) 10Bunnypranav: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc) [03:17:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:36] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.3 refs T392173 (duration: 46m 49s) [03:49:40] T392173: 1.45.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T392173 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T0400) [04:01:59] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.28 (duration: 01m 50s) [04:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:07:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:40:28] (03CR) 10Andriy.v: [C:03+1] Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc) [06:00:07] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T0600) [06:00:08] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T0600). [06:02:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:23:02] (03CR) 10Bunnypranav: [C:03+1] Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc) [06:23:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149805 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [06:28:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:31:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [06:33:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:05] (03PS1) 10Fabfur: haproxy: truncate isp name to 64 bytes [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) [06:38:00] !log failover Ganeti master in magru/B4 to ganeti7004 T394263 [06:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:05] T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 [06:38:34] (03CR) 10Ayounsi: [C:03+2] Interfaces: also alert on frack routers and switches [alerts] - 10https://gerrit.wikimedia.org/r/1150692 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:39:54] PROBLEM - ganeti-wconfd running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [06:40:15] (03Merged) 10jenkins-bot: Interfaces: also alert on frack routers and switches [alerts] - 10https://gerrit.wikimedia.org/r/1150692 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:44:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [06:51:38] PROBLEM - Host ganeti-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [06:55:30] RECOVERY - Host ganeti-test2002 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [06:58:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T0700). [07:00:05] samwilson, MatmaRex, Ahonc, and bunnypranav: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:30] hi [07:00:53] here [07:01:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [07:03:46] hullo [07:05:16] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 15133 [07:05:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15133 [07:07:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [07:09:00] o/ [07:09:04] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:10:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:12:24] I love the fun quirky messages of jouncebot. Lifts up the spirits of everyone around. :) [07:14:32] sorry, my IRC disconnected [07:14:46] is anyone deploying? [07:15:21] No one yet [07:19:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [07:21:27] Amir1, Urbanecm, awight: are any of you around to deploy? [07:23:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Alternatively we could also update the imposm-initial-script script so that set_permissions() is run twice, once at the curren" [puppet] - 10https://gerrit.wikimedia.org/r/1150718 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:23:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:25:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10858158 (10MoritzMuehlenhoff) 05Resolved→03Open zuul2001 is running Bullseye, that seems like a mistake? The other... [07:29:00] tgr or hashar are you available by any chance? [07:29:14] good morning [07:29:27] sorry I was already processing my reviews/emails queues :b [07:29:31] what is happening? [07:29:59] there's no one to deploy for the current backport window [07:30:19] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 17806 [07:30:25] I ll do them [07:30:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17806 [07:30:45] thank you! :) [07:31:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to plain [07:31:15] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10858197 (10Stevemunene) [07:32:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to plain [07:32:07] so bunnypranav change is straightforwar,d I guess I'll have to run namespaceDupes script [07:32:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [07:32:22] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 10089 [07:32:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 10089 [07:33:27] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 15932 [07:33:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15932 [07:33:33] hashar: I believe so, yes. Don't exactly know what the script solves here. [07:33:33] (03PS1) 10Fabfur: hiera: enable maxmind ISP lookup in esams [puppet] - 10https://gerrit.wikimedia.org/r/1151109 (https://phabricator.wikimedia.org/T395295) [07:33:41] probably nothing :) [07:33:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to plain [07:33:45] for for the extendedmovers group for ukwiki [07:33:54] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1150778 [07:33:58] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:34:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858200 (10ops-monitoring-bot) VM doh7002.wikimedia.org switching disk type to plain [07:34:14] I don't know what are the permissions oathauth-enable or tboverride :/ [07:34:16] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:34:21] Also, thanks for deploying! [07:34:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to plain [07:34:46] hashar: made similar to enwiki [07:34:54] PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:35:30] ah indeed :) [07:35:32] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52468 [07:35:33] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'email' for AS: 52468 [07:35:55] RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:35:56] oathauth-enable allows enabling of 2FA. tboverride is allows override of title blacklist. [07:36:16] bunnypranav: cool thanks! :) [07:36:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151109 (https://phabricator.wikimedia.org/T395295) (owner: 10Fabfur) [07:36:46] then samwilson patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1150629 which looks all well constrained [07:37:06] (03PS1) 10Stevemunene: hdfs: Exclude group 5 rack F1 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151112 (https://phabricator.wikimedia.org/T390172) [07:37:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10858203 (10Stevemunene) a:03Stevemunene [07:37:31] is someone working on durum7002? [07:37:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149805 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [07:37:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc) [07:37:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [07:37:40] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 269180 [07:37:58] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:38:00] hashar: yes it should be reasonably low-risk I think [07:38:11] fabfur: given the spam here, you might want to ask -sre :) [07:38:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269180 [07:38:14] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:38:59] (03Merged) 10jenkins-bot: core-Namespaces: Update Malay wiki (mswiki) namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149805 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [07:39:02] (03Merged) 10jenkins-bot: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc) [07:39:04] (03Merged) 10jenkins-bot: InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [07:39:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [07:39:50] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1149805|core-Namespaces: Update Malay wiki (mswiki) namespace aliases (T394603)]], [[gerrit:1150778|Add user group extendedmover to ukwiki (T395285)]], [[gerrit:1150629|InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki (T377975)]] [07:39:56] T394603: Configure the namespaces on Malay Wikipedia - https://phabricator.wikimedia.org/T394603 [07:39:56] T395285: Add user group extendedmover to ukwiki - https://phabricator.wikimedia.org/T395285 [07:39:57] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [07:40:11] (03PS18) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) [07:40:11] (03CR) 10Arnaudb: "Your uninformed estimation opened a path we did not explore previously! thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [07:42:04] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 134823 [07:42:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 134823 [07:42:41] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set cloudcephmon1004 as active after disk replacement - taavi@cumin1002" [07:43:22] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 209453 [07:43:22] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 209453 [07:43:35] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set cloudcephmon1004 as active after disk replacement - taavi@cumin1002" [07:44:01] !log hashar@deploy1003 hashar, bunnypranav, samwilson, ahonc: Backport for [[gerrit:1149805|core-Namespaces: Update Malay wiki (mswiki) namespace aliases (T394603)]], [[gerrit:1150778|Add user group extendedmover to ukwiki (T395285)]], [[gerrit:1150629|InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can [07:44:01] now be verified there. [07:44:12] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 45287 [07:44:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45287 [07:46:07] looks ok [07:46:19] (03PS1) 10Brouberol: airflow: enable the same envoy discovery listeners in all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151114 (https://phabricator.wikimedia.org/T369845) [07:46:19] bunnypranav: samwilson: Ahonc: your patches are on the debug servers :) [07:46:23] Ahonc: great thank you [07:46:38] hashar: the wgTemplateDataEnableDiscovery change looks good in debug [07:46:55] All good, works as intended! [07:46:56] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:06] !log hashar@deploy1003 hashar, bunnypranav, samwilson, ahonc: Continuing with sync [07:47:10] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:49:20] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [07:50:19] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 4800 [07:51:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4800 [07:52:42] RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:54:17] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 151575 [07:54:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 151575 [07:54:56] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:55:10] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:18] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [07:56:22] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149805|core-Namespaces: Update Malay wiki (mswiki) namespace aliases (T394603)]], [[gerrit:1150778|Add user group extendedmover to ukwiki (T395285)]], [[gerrit:1150629|InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki (T377975)]] (duration: 16m 31s) [07:56:30] T394603: Configure the namespaces on Malay Wikipedia - https://phabricator.wikimedia.org/T394603 [07:56:31] T395285: Add user group extendedmover to ukwiki - https://phabricator.wikimedia.org/T395285 [07:56:33] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [07:56:48] so that got deployed [07:56:49] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 1273 [07:57:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1273 [07:57:09] thank you! [07:57:46] Thanks a lot for saving the day! (I mean the backport windows) :D [07:58:23] thanks [07:58:33] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 13489 [07:58:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13489 [07:58:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to plain [07:59:05] bunnypranav: so last time I did a namespace change n the Malay Wikipedia and namespacedupes returned nothing https://phabricator.wikimedia.org/T394603#10839297 [07:59:09] but this time I have a bunch of entries :) [07:59:17] 4528 links to fix, 4528 were resolvable, 0 were deleted. [07:59:19] (03CR) 10Aqu: [C:03+1] "Excellent!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151114 (https://phabricator.wikimedia.org/T369845) (owner: 10Brouberol) [07:59:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858261 (10ops-monitoring-bot) VM bast7001.wikimedia.org switching disk type to plain [07:59:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to plain [07:59:42] do you have a bit more time for the wmf.2 backports, or should i reschedule? [07:59:59] MatmaRex: can you do it while I debug the namespacedupe thing? [08:00:12] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16509 [08:00:15] i don't have shell access [08:00:25] 😨 [08:01:16] hashar: Oh, I guess cause I removed a older alias in favour of a new one. [08:01:20] i can reschedule them for the afternoon, no problem. thanks for deploying the rest :) [08:01:23] Is it a problem? [08:01:57] MatmaRex: I have added both in spiderpig so they are processing now :) [08:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149822 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [08:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149823 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [08:02:11] hashar: oh, thanks! [08:02:15] bunnypranav: yeah I will run the script. There is one link broken and I will report on the task [08:02:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:02:23] id=1255216 ns=0 dbk=Portal:Masyarakat *** dest title exists and --add-prefix not specified [08:02:32] (03CR) 10Brouberol: [C:03+2] airflow: enable the same envoy discovery listeners in all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151114 (https://phabricator.wikimedia.org/T369845) (owner: 10Brouberol) [08:02:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:02:43] Ok, sure [08:03:29] hashar: Please ping @Hakimi97 on phab, they can probably fix it. [08:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:13] I'll have to leave now. Thanks again for the deploy! :) [08:05:38] ayounsi@cumin1002 peering (PID 2855751) is awaiting input [08:06:33] (03PS1) 10Slyngshede: P:openldap::management remove ops-limited from validation [puppet] - 10https://gerrit.wikimedia.org/r/1151118 [08:07:08] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1011.eqiad.wmnet [08:10:45] bunnypranav: looks like the broken page is a redirect https://ms.wikipedia.org/w/index.php?title=Gerbang:T394603/Masyarakat&redirect=no I have reopened the task and I guess Hakimi97 will take care of it [08:10:46] T394603: Configure the namespaces on Malay Wikipedia - https://phabricator.wikimedia.org/T394603 [08:10:52] bunnypranav: have a good day [08:11:24] MatmaRex: I really thought you had deployments rights :b [08:12:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1011.eqiad.wmnet [08:12:44] i have enough hats, i managed to dodge this one [08:13:39] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1011.eqiad.wmnet with OS bookworm [08:13:48] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10858282 (10MatthewVernon) >>! In T378922#10848007, @Jelto wrote: >>>! In T378922#10847339, @MatthewVernon wrote: >> Ah, the bucket... [08:14:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [08:14:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858289 (10ops-monitoring-bot) Draining ganeti7002.magru.wmnet of running VMs [08:14:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:15:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet [08:15:55] (03Merged) 10jenkins-bot: Do not save on Session::renew() when there's nothing to renew [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149822 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [08:16:01] (03Merged) 10jenkins-bot: Don't save after Session::delaySave() when there's no delayed save [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149823 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [08:16:28] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1149822|Do not save on Session::renew() when there's nothing to renew (T392251)]], [[gerrit:1149823|Don't save after Session::delaySave() when there's no delayed save (T392251)]] [08:16:32] T392251: SessionBackend seems to store session changes too often - https://phabricator.wikimedia.org/T392251 [08:16:36] MatmaRex: your patches are in the pipes [08:16:53] yup [08:17:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to plain [08:17:01] MatmaRex: and if I remember properly we have extended/debug logs for sessions [08:17:26] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:17:38] so maybe your patches will cut some of the logs there [08:17:46] we have all kinds of debug logs for them… we've actually been working on removing some [08:17:59] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10858295 (10Jelto) Thanks a lot @MatthewVernon, I can confirm the buckets are gone. I'll re-create the buckets soon and apply the A... [08:18:03] (but that's not in this backport) [08:18:19] are you doing that with Gergo? [08:18:21] i'm planning to test this by just logging in and out on a sock account [08:18:30] !log hashar@deploy1003 hashar, tgr: Backport for [[gerrit:1149822|Do not save on Session::renew() when there's nothing to renew (T392251)]], [[gerrit:1149823|Don't save after Session::delaySave() when there's no delayed save (T392251)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:18:36] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10858296 (10Jelto) [08:18:37] and derick, yeah [08:18:38] https://phabricator.wikimedia.org/T394402 [08:19:15] (03PS1) 10Klausman: role::ml_k8s::worker: upgrade ml-serve1011 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151120 (https://phabricator.wikimedia.org/T387854) [08:19:27] awesome! :) [08:19:40] MatmaRex: changes are live on the debug servers [08:20:00] jmm@cumin2002 changedisk (PID 2969929) is awaiting input [08:20:30] (03PS2) 10Jgiannelos: pcs: Disable changeprop rule for summary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) [08:20:49] hashar: yeah, i was just checking them out. looks good [08:20:55] !log hashar@deploy1003 hashar, tgr: Continuing with sync [08:22:30] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10858303 (10MatthewVernon) Yeah, the apus cluster isn't ideal for buckets with a very large number of objects in (if we wanted to s... [08:22:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858304 (10ops-monitoring-bot) VM ncredir7002.magru.wmnet switching disk type to plain [08:23:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to plain [08:26:02] hashar: Thanks for the comment. Hope you have a good day as well! :) [08:26:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to plain [08:26:24] (03Abandoned) 10Cathal Mooney: Add entry for cagefive2* hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [08:26:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858307 (10ops-monitoring-bot) VM prometheus7001.magru.wmnet switching disk type to plain [08:26:44] (03CR) 10Klausman: [V:03+2 C:03+2] role::ml_k8s::worker: upgrade ml-serve1011 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151120 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [08:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to plain [08:27:57] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149822|Do not save on Session::renew() when there's nothing to renew (T392251)]], [[gerrit:1149823|Don't save after Session::delaySave() when there's no delayed save (T392251)]] (duration: 11m 29s) [08:28:01] T392251: SessionBackend seems to store session changes too often - https://phabricator.wikimedia.org/T392251 [08:28:12] thanks hashar [08:28:35] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 16509 [08:28:39] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 209453 [08:28:39] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 209453 [08:29:57] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 32098 [08:30:23] !log UTC morning backport window has been completed. [08:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32098 [08:31:04] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 399728 [08:31:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 399728 [08:32:47] !log deploying debmonitor-client v0.4.1-1 fleet-wide [08:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 40217 [08:33:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217 [08:34:16] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10858327 (10Jelto) [08:34:59] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 40217 [08:35:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217 [08:36:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10858342 (10MoritzMuehlenhoff) 05Resolved→03Open mc-2001 isn't up and there's also no serial console [08:38:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 40217 [08:39:01] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217 [08:39:28] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 40217 [08:39:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217 [08:40:15] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 59605 [08:40:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 59605 [08:40:45] (03CR) 10Elukey: "We can probably do both for safety, what do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/1150718 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:41:33] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16347 [08:41:51] (03CR) 10Muehlenhoff: [C:03+1] "Agreed, the addition of the SQL grants should be fully idempotent and this seems more robust." [puppet] - 10https://gerrit.wikimedia.org/r/1150718 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:42:14] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1011.eqiad.wmnet with reason: host reimage [08:42:21] (03CR) 10Elukey: [C:03+2] profile::maps: add default privileges for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1150718 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:42:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16347 [08:43:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 20857 [08:43:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20857 [08:44:18] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28598 [08:44:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28598 [08:45:16] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1011.eqiad.wmnet with reason: host reimage [08:45:19] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 29208 [08:45:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 29208 [08:45:52] (03PS2) 10Brouberol: global_config: propagate kerberos admin and server hostnames to k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1151121 (https://phabricator.wikimedia.org/T395297) [08:46:03] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 34141 [08:46:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34141 [08:46:20] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:46:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:31] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 9002 [08:46:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9002 [08:49:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858454 (10MoritzMuehlenhoff) [08:52:39] (03PS2) 10Fabfur: haproxy: truncate isp to 64 bytes, lowecase and change header name [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) [08:53:34] (03CR) 10Volans: "Left some very minor suggestions and a question." [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) (owner: 10JMeybohm) [08:54:30] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:02] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 99, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:28] !log remove ganeti7002 from the magru02 cluster T394263 [08:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:32] T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 [08:57:44] (03PS1) 10Muehlenhoff: Remove Puppet references to ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1151133 (https://phabricator.wikimedia.org/T394263) [08:57:58] PROBLEM - ganeti-confd running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [08:58:14] PROBLEM - ganeti-noded running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [08:58:30] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:43] FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:30] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:00:26] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1011.eqiad.wmnet with OS bookworm [09:05:05] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS bookworm [09:05:31] (03PS1) 10Klausman: role::ml_k8s::worker: upgrade ml-serve1010 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151134 (https://phabricator.wikimedia.org/T387854) [09:05:41] (03PS3) 10Brouberol: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151132 (https://phabricator.wikimedia.org/T395297) [09:05:45] (03PS3) 10Brouberol: airflow: pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151131 (https://phabricator.wikimedia.org/T395297) [09:05:48] (03PS3) 10Brouberol: airflow: enable hadoop shell everywhere except ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151130 (https://phabricator.wikimedia.org/T395297) [09:05:53] (03PS3) 10Brouberol: airflow: move WMF specific values to environment values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151129 (https://phabricator.wikimedia.org/T395297) [09:05:58] (03PS3) 10Brouberol: airflow: assume KubernetesExecutor by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151128 (https://phabricator.wikimedia.org/T395297) [09:06:04] (03PS3) 10Brouberol: airflow: stop repeating paths by defining YAML variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151127 (https://phabricator.wikimedia.org/T395297) [09:06:12] (03PS2) 10Brouberol: airflow: assume that cloudnative is always used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151126 (https://phabricator.wikimedia.org/T395297) [09:06:17] (03PS2) 10Brouberol: airflow: delete un-necessary fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151125 (https://phabricator.wikimedia.org/T395297) [09:08:20] (03PS2) 10JMeybohm: sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) [09:08:30] (03CR) 10JMeybohm: sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) (owner: 10JMeybohm) [09:11:02] (03CR) 10Brouberol: [C:03+1] hdfs: Exclude group 5 rack F1 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151112 (https://phabricator.wikimedia.org/T390172) (owner: 10Stevemunene) [09:14:53] (03PS4) 10Brouberol: airflow: stop repeating paths by defining YAML variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151127 (https://phabricator.wikimedia.org/T395297) [09:14:53] (03PS4) 10Brouberol: airflow: assume KubernetesExecutor by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151128 (https://phabricator.wikimedia.org/T395297) [09:14:53] (03PS4) 10Brouberol: airflow: move WMF specific values to environment values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151129 (https://phabricator.wikimedia.org/T395297) [09:14:54] (03PS4) 10Brouberol: airflow: enable hadoop shell everywhere except ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151130 (https://phabricator.wikimedia.org/T395297) [09:14:55] (03PS4) 10Brouberol: airflow: pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151131 (https://phabricator.wikimedia.org/T395297) [09:14:56] (03PS4) 10Brouberol: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151132 (https://phabricator.wikimedia.org/T395297) [09:16:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:17:54] (03CR) 10Volans: [C:03+1] "LGTM cookbook wise, for the k8s specific logic I'll leave it to you." [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) (owner: 10JMeybohm) [09:18:32] (03CR) 10Klausman: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1010 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151134 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [09:18:34] RESOLVED: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:33] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147105 (owner: 10PipelineBot) [09:22:06] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1010.eqiad.wmnet [09:24:21] (03CR) 10FNegri: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [09:24:54] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10858578 (10MatthewVernon) >>! In T394476#10856108, @akosiaris wrote: >> If you want to do some testing, I could set you up with a test account on apus. > > That w... [09:25:15] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [09:25:21] (03CR) 10Stevemunene: [C:03+2] hdfs: Exclude group 5 rack F1 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151112 (https://phabricator.wikimedia.org/T390172) (owner: 10Stevemunene) [09:27:05] (03CR) 10Jgiannelos: "Hm, this is removing the purge rule for /definition/ too so we need to add an event on PCS level first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [09:27:11] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1010.eqiad.wmnet [09:28:30] PROBLEM - Host ml-serve1010 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:12] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [09:30:08] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:05] (03CR) 10Muehlenhoff: [C:03+2] Remove Puppet references to ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1151133 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:32:38] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10858614 (10elukey) Thanks a lot @MatthewVernon ! >>! In T394476#10858578, @MatthewVernon wrote: >>>! In T394476#10856108, @akosiaris wrote: >>> If you want to do... [09:32:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10858615 (10Stevemunene) [09:33:15] !log brouberol@cumin2002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet [09:33:50] FIRING: KubernetesCalicoDown: ml-serve1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1010.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:34:18] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1011.eqiad.wmnet [09:34:19] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1011.eqiad.wmnet [09:35:46] RECOVERY - Host ml-serve1010 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:36:04] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1010.eqiad.wmnet with OS bookworm [09:36:08] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:37:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858620 (10ayounsi) [09:37:40] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1148891 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:39:34] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet [09:39:46] (03PS1) 10Phuedx: EventStreamConfig: Remove xLab development streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151148 (https://phabricator.wikimedia.org/T393918) [09:40:08] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:41:46] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:44:33] (03PS1) 10Muehlenhoff: Reimage ganeti7002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1151149 (https://phabricator.wikimedia.org/T394263) [09:44:46] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:47:46] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:47:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151149 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:47:56] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS bookworm [09:49:25] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet2008-dev.codfw.wmnet with OS bookworm [09:49:52] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1151121 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [09:51:38] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: propagate kerberos admin and server hostnames to k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1151121 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [09:52:28] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:53:43] (03CR) 10Hnowlan: [C:03+1] pcs: Disable changeprop rule for summary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [09:55:18] (03CR) 10Stevemunene: [C:03+1] airflow: delete un-necessary fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151125 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [09:57:46] klausman@cumin1003 reimage (PID 3414017) is awaiting input [09:58:03] (03CR) 10Stevemunene: [C:03+1] airflow: assume that cloudnative is always used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151126 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [09:59:00] (03PS3) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) [09:59:18] (03CR) 10Stevemunene: [C:03+1] airflow: stop repeating paths by defining YAML variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151127 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [09:59:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151149 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:59:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1151118 (owner: 10Slyngshede) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1000) [10:00:12] (03CR) 10Stevemunene: [C:03+1] airflow: assume KubernetesExecutor by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151128 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:01:33] (03CR) 10Stevemunene: [C:03+1] airflow: move WMF specific values to environment values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151129 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:02:45] (03CR) 10Stevemunene: [C:03+1] airflow: enable hadoop shell everywhere except ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151130 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:02:45] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151152 [10:04:44] (03CR) 10Stevemunene: [C:03+1] airflow: pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151131 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:27] (03CR) 10Brouberol: [C:03+2] airflow: delete un-necessary fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151125 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:30] (03CR) 10Brouberol: [C:03+2] airflow: assume that cloudnative is always used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151126 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:33] (03CR) 10Brouberol: [C:03+2] airflow: stop repeating paths by defining YAML variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151127 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:37] (03CR) 10Brouberol: [C:03+2] airflow: assume KubernetesExecutor by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151128 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:42] (03CR) 10Brouberol: [C:03+2] airflow: move WMF specific values to environment values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151129 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:45] (03CR) 10Stevemunene: [C:03+1] airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151132 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:49] (03CR) 10Brouberol: [C:03+2] airflow: enable hadoop shell everywhere except ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151130 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:53] (03CR) 10Brouberol: [C:03+2] airflow: pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151131 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:06:57] (03CR) 10Brouberol: [C:03+2] airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151132 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:08:11] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1010.eqiad.wmnet with reason: host reimage [10:08:28] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:08:39] (03Merged) 10jenkins-bot: airflow: delete un-necessary fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151125 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:08:50] (03Merged) 10jenkins-bot: airflow: assume that cloudnative is always used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151126 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:08:51] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [10:08:52] (03Merged) 10jenkins-bot: airflow: stop repeating paths by defining YAML variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151127 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:08:54] (03Merged) 10jenkins-bot: airflow: assume KubernetesExecutor by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151128 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:08:55] (03Merged) 10jenkins-bot: airflow: move WMF specific values to environment values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151129 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:09:06] (03Merged) 10jenkins-bot: airflow: enable hadoop shell everywhere except ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151130 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:09:08] (03Merged) 10jenkins-bot: airflow: pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151131 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:09:09] (03Merged) 10jenkins-bot: airflow: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151132 (https://phabricator.wikimedia.org/T395297) (owner: 10Brouberol) [10:10:24] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151152 (owner: 10PipelineBot) [10:11:28] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:11:35] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1010.eqiad.wmnet with reason: host reimage [10:11:57] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151152 (owner: 10PipelineBot) [10:13:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:13:58] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1155.eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 [10:14:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10858801 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3c8643fe-b481-4239-80ee-e7e76ed5671f) set b... [10:14:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:14:41] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1148.eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 [10:14:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10858802 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=af48c4bb-2962-485c-83a5-e464f4d86772) set b... [10:15:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [10:15:33] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:15:34] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [10:15:50] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:16:48] (03PS1) 10Elukey: prometheus: allow N/A in the GPU Power details of the ROCM exporter [puppet] - 10https://gerrit.wikimedia.org/r/1151153 [10:18:13] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:18:19] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1151153 (owner: 10Elukey) [10:18:32] (03CR) 10Muehlenhoff: [C:03+2] Reimage ganeti7002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1151149 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:18:46] (03CR) 10Elukey: [C:03+2] prometheus: allow N/A in the GPU Power details of the ROCM exporter [puppet] - 10https://gerrit.wikimedia.org/r/1151153 (owner: 10Elukey) [10:18:48] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [10:18:51] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:18:58] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:19:32] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:20:46] (03CR) 10Jgiannelos: [C:03+2] pcs: Disable changeprop rule for summary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [10:22:48] (03Merged) 10jenkins-bot: pcs: Disable changeprop rule for summary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [10:24:48] (03CR) 10Effie Mouzeli: "sorted, and tests worked" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [10:25:00] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:25:27] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:27:08] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:33] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10858817 (10MatthewVernon) [10:27:41] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1010.eqiad.wmnet with OS bookworm [10:28:38] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1010.eqiad.wmnet [10:28:39] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1010.eqiad.wmnet [10:29:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [10:29:48] (03PS1) 10Elukey: role::ml_k8s::worker: upgrade ml-serve1005 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151156 (https://phabricator.wikimedia.org/T387854) [10:30:53] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1005.eqiad.wmnet [10:33:00] (03CR) 10Slyngshede: [C:03+2] P:openldap::management remove ops-limited from validation [puppet] - 10https://gerrit.wikimedia.org/r/1151118 (owner: 10Slyngshede) [10:33:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:35] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti7002.magru.wmnet with OS bookworm [10:33:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ganeti7002.magru.wmnet with OS bookworm [10:33:55] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2008-dev.codfw.wmnet with OS bookworm [10:34:50] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1005 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151156 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [10:35:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1005.eqiad.wmnet [10:36:50] (03PS1) 10MVernon: thanos: add new backends to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1151159 (https://phabricator.wikimedia.org/T391352) [10:36:52] (03PS1) 10MVernon: thanos: add new backends, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1151160 (https://phabricator.wikimedia.org/T391352) [10:37:20] (03PS3) 10Slyngshede: P:idp experimental webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) [10:37:27] (03PS1) 10Bunnypranav: core-Permissions:Create reviewer role on eswikivoyage, remove patroller and rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151161 (https://phabricator.wikimedia.org/T395293) [10:38:23] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5684/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) (owner: 10Slyngshede) [10:38:28] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:40:09] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:40:51] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:40:58] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:41:20] (03PS6) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) [10:41:20] (03PS16) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) [10:41:23] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:41:34] (03CR) 10JMeybohm: "I'm wondering: Will mw-experimental mount the GeoIP path as well? If so, you will unfortunately have to include those paths in this policy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [10:41:48] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [10:42:57] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bookworm [10:43:20] (03CR) 10FNegri: wikireplicas: split db config from maintain-views (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [10:43:27] (03PS5) 10Effie Mouzeli: admin_ng: add mw-experimental namespace with hostPath support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [10:44:14] (03PS6) 10Effie Mouzeli: admin_ng: add mw-experimental namespace with hostPath support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [10:44:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [10:45:42] (03CR) 10JMeybohm: [C:03+1] validating-admission-policies: fix typo in Makefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150749 (owner: 10Effie Mouzeli) [10:45:47] (03CR) 10Effie Mouzeli: "I was afraid you would say that, and according to the internet, you are most likely right. I will sort it, while crying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [10:46:03] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5685/co" [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [10:46:05] (03CR) 10Vgutierrez: [C:03+1] haproxy: truncate isp to 64 bytes, lowecase and change header name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [10:46:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:26] (03CR) 10JMeybohm: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [10:46:38] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:47:27] (03CR) 10JMeybohm: [C:04-1] admin_ng: add mw-experimental namespace with hostPath support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:55:59] (03PS1) 10MVernon: hiera: add apus-be2004 to codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151166 (https://phabricator.wikimedia.org/T391354) [10:56:50] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage [10:56:56] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151166 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [10:58:15] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [10:58:43] (03PS2) 10MVernon: hiera: add apus-be2004 to codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151166 (https://phabricator.wikimedia.org/T391354) [10:58:53] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151166 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [11:00:59] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:01:11] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:02:55] (03CR) 10Kamila Součková: [C:03+1] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) (owner: 10Hnowlan) [11:03:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage [11:03:59] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:04:11] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:14] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [11:05:06] (03CR) 10Fabfur: haproxy: truncate isp to 64 bytes, lowecase and change header name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [11:05:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151161 (https://phabricator.wikimedia.org/T395293) (owner: 10Bunnypranav) [11:13:52] (03PS7) 10Effie Mouzeli: admin_ng: add mw-experimental namespace with hostPath support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [11:14:07] (03CR) 10Effie Mouzeli: [C:03+2] validating-admission-policies: fix typo in Makefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150749 (owner: 10Effie Mouzeli) [11:15:20] (03Merged) 10jenkins-bot: validating-admission-policies: fix typo in Makefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150749 (owner: 10Effie Mouzeli) [11:17:11] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage [11:18:09] (03PS8) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [11:18:14] (03CR) 10CI reject: [V:04-1] admin_ng: add mw-experimental namespace with hostPath support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:19:11] (03PS8) 10Effie Mouzeli: admin_ng: add mw-experimental namespace with hostPath support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [11:21:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1005.eqiad.wmnet with reason: host reimage [11:25:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7002.magru.wmnet with OS bookworm [11:25:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10858935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ganeti7002.magru.wmnet with OS bookworm completed: - ganeti7002 (**PASS**) - Dow... [11:28:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:28:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T395241)', diff saved to https://phabricator.wikimedia.org/P76467 and previous config saved to /var/cache/conftool/dbconfig/20250527-112848-fceratto.json [11:36:01] (03PS1) 10Muehlenhoff: routed_ganeti: Configure nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1151176 [11:36:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T395241)', diff saved to https://phabricator.wikimedia.org/P76468 and previous config saved to /var/cache/conftool/dbconfig/20250527-113619-fceratto.json [11:37:48] (03PS1) 10Jgiannelos: pcs/RB sunset: Remove unnecessary definition rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151177 [11:39:15] (03PS2) 10Jgiannelos: pcs/RB sunset: Remove unnecessary definition rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151177 [11:39:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151176 (owner: 10Muehlenhoff) [11:39:43] (03PS1) 10Slyngshede: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 [11:44:59] (03CR) 10Jcrespo: [C:03+1] thanos: add new backends to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1151159 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [11:46:19] (03CR) 10MVernon: [C:03+2] thanos: add new backends to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1151159 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [11:46:25] (03CR) 10Federico Ceratto: [C:03+1] "I see thanos-be1006.eqiad.wmnet to thanos-be1009.eqiad.wmnet (4 hosts) being added to thanos backend." [puppet] - 10https://gerrit.wikimedia.org/r/1151159 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [11:48:38] (03PS21) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) [11:49:30] (03CR) 10Federico Ceratto: [C:03+1] "I see 4 hosts thanos-be1001 to thanos-be1004 being drained, plus 4 hosts thanos-be1006 to thanos-be1009 being added to prod24_ng" [puppet] - 10https://gerrit.wikimedia.org/r/1151160 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [11:51:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P76469 and previous config saved to /var/cache/conftool/dbconfig/20250527-115127-fceratto.json [11:52:08] (03CR) 10Jcrespo: [C:03+1] hiera: add apus-be2004 to codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151166 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [11:52:45] !log reboot thanos-be100[6-9] before bringing into the rings T391352 [11:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:50] T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352 [11:53:46] (03CR) 10MVernon: [C:03+2] hiera: add apus-be2004 to codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151166 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [11:54:24] PROBLEM - Host thanos-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:38] PROBLEM - Host thanos-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:04] PROBLEM - Host thanos-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:06] PROBLEM - Host thanos-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:11] ? [11:55:16] RECOVERY - Host thanos-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [11:55:18] this doesn't look good? [11:55:30] rebooted those 4 hosts before bringing them into production (see my !log above); sorry, should have also d/timed [11:55:34] RECOVERY - Host thanos-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [11:55:34] RECOVERY - Host thanos-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [11:55:44] oh, sorry, missed the update [11:55:54] RECOVERY - Host thanos-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:56:48] (03CR) 10MVernon: [C:03+2] thanos: add new backends, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1151160 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [11:59:23] !log installing nodejs security updates [11:59:24] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1119.eqiad.wmnet [11:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:01] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10859020 (10MatthewVernon) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1200) [12:00:55] !log ceph orch apply to bring apus-be2004 into service T391354 [12:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:59] T391354: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354 [12:01:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:02:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:03:05] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10859029 (10MatthewVernon) [12:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:24] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10859030 (10MatthewVernon) [12:06:34] (03CR) 10Cory Massaro: functions-orchestrator: add mcrouter module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [12:06:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P76470 and previous config saved to /var/cache/conftool/dbconfig/20250527-120635-fceratto.json [12:08:41] (03CR) 10Cory Massaro: [C:03+1] "Thank you! This looks good to me; leaving to @gchoi@wikimedia.org to +2." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [12:08:44] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1005.eqiad.wmnet with OS bookworm [12:14:41] jouncebot: nowandnext [12:14:41] For the next 0 hour(s) and 45 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1200) [12:14:41] In 0 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1300) [12:14:55] (03CR) 10Muehlenhoff: [C:03+2] routed_ganeti: Configure nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1151176 (owner: 10Muehlenhoff) [12:18:26] (03PS2) 10Majavah: prometheus: Fix homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/1146973 [12:19:46] PROBLEM - SSH on an-worker1119 is CRITICAL: connect to address 10.64.5.8 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:19:52] (03CR) 10Majavah: "I missed that, thanks! Updated the patch. (Long-term I think there are some opportunities here to clean up the code, for example the rewri" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah) [12:20:17] (03PS1) 10Marostegui: db2*: Remove sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1151185 (https://phabricator.wikimedia.org/T394884) [12:21:03] (03PS1) 10Muehlenhoff: routed_ganeti: Move more common settings to the common hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1151186 (https://phabricator.wikimedia.org/T394263) [12:21:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T395241)', diff saved to https://phabricator.wikimedia.org/P76472 and previous config saved to /var/cache/conftool/dbconfig/20250527-122142-fceratto.json [12:22:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:22:05] (03CR) 10Marostegui: [C:03+2] db2*: Remove sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1151185 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [12:22:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:22:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T395241)', diff saved to https://phabricator.wikimedia.org/P76473 and previous config saved to /var/cache/conftool/dbconfig/20250527-122226-fceratto.json [12:22:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 20115 [12:23:00] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20115 [12:28:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T395241)', diff saved to https://phabricator.wikimedia.org/P76474 and previous config saved to /var/cache/conftool/dbconfig/20250527-122858-fceratto.json [12:30:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151186 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:35:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx) [12:37:00] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudlb2003-dev.codfw.wmnet with OS bookworm [12:39:23] (03PS2) 10Muehlenhoff: routed_ganeti: Move more common settings to the common hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1151186 (https://phabricator.wikimedia.org/T394263) [12:39:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151196 [12:40:12] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:41:06] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:42:17] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1119.eqiad.wmnet [12:42:20] (03CR) 10Muehlenhoff: "One note on the GID allocation and a question: Will there be also any sudo rules assigned as a followup or is this solely for the group me" [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [12:42:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151186 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:44:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P76475 and previous config saved to /var/cache/conftool/dbconfig/20250527-124406-fceratto.json [12:45:31] (03PS9) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [12:45:58] (03CR) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [12:46:18] (03PS10) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [12:46:21] (03PS1) 10Marostegui: mariadb: Move db2186 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/1151198 (https://phabricator.wikimedia.org/T394884) [12:46:41] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1005.eqiad.wmnet [12:46:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1005.eqiad.wmnet [12:46:46] (03CR) 10Fabfur: haproxy: truncate isp to 64 bytes, lowecase and change header name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [12:46:47] (03CR) 10Fabfur: [C:03+2] haproxy: truncate isp to 64 bytes, lowecase and change header name [puppet] - 10https://gerrit.wikimedia.org/r/1151011 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [12:47:17] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1006.eqiad.wmnet [12:47:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2191.codfw.wmnet with reason: Maintenance [12:48:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:48:12] (03PS1) 10Elukey: role::ml_k8s::worker: upgrade ml-serve1006 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151199 (https://phabricator.wikimedia.org/T387854) [12:48:32] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet [12:48:41] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2191.codfw.wmnet [12:49:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2191 T395241', diff saved to https://phabricator.wikimedia.org/P76476 and previous config saved to /var/cache/conftool/dbconfig/20250527-124929-marostegui.json [12:49:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet [12:49:42] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2191.codfw.wmnet [12:51:16] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1006 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151199 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [12:51:29] (03PS1) 10Ayounsi: Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) [12:52:42] (03CR) 10CI reject: [V:04-1] Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [12:52:50] (03PS1) 10Muehlenhoff: Apply ganeti_routed role to ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1151201 (https://phabricator.wikimedia.org/T394263) [12:53:42] (03PS1) 10Klausman: role::ml_k8s::worker: upgrade ml-serve1009 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151202 (https://phabricator.wikimedia.org/T387854) [12:54:43] (03CR) 10Kamila Součková: [C:03+1] sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) (owner: 10JMeybohm) [12:56:03] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1009.eqiad.wmnet [12:56:22] (03PS11) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [12:56:36] (03CR) 10Klausman: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1009 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151202 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [12:56:42] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [12:56:56] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet [12:57:06] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db2191 - Upgrading db2191.codfw.wmnet [12:57:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2191 - Upgrading db2191.codfw.wmnet [12:57:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1006.eqiad.wmnet [12:58:08] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [12:58:20] (03CR) 10JMeybohm: [C:03+2] sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) (owner: 10JMeybohm) [12:59:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P76479 and previous config saved to /var/cache/conftool/dbconfig/20250527-125913-fceratto.json [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1300) [13:00:04] bunnypranav, Tchanders, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:19] o/ [13:01:07] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1009.eqiad.wmnet [13:01:17] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bookworm [13:01:18] !log installing intel-microcode security updates on Bullseye [13:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:56] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove cloudnet2007/8 cloud-private dns records for now - taavi@cumin1002" [13:02:00] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1009.eqiad.wmnet with OS bookworm [13:02:02] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [13:02:13] !log taavi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove cloudnet2007/8 cloud-private dns records for now - taavi@cumin1002" [13:02:13] !log taavi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:02:15] o/ [13:02:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2191.codfw.wmnet [13:03:03] I can deploy in a moment if nobody else is around to deploy [13:03:23] !log klausman@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1009.eqiad.wmnet with OS bookworm [13:03:58] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add virtual-magru networks - taavi@cumin1002" [13:04:16] PROBLEM - Host an-worker1119 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:36] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add virtual-magru networks - taavi@cumin1002" [13:04:36] (03PS2) 10Ayounsi: Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) [13:04:37] Tchanders: with SpiderPig, any device with a browser can be your deploy machine ;) [13:04:46] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:04:48] * Lucas_WMDE takes off the marketing hat [13:04:53] except you still need to SSH in for an OTP :/ [13:05:05] true, you need to have set that up [13:05:09] Lucas_WMDE: I thought that, but I'm being asked to ssh in to get a password (though I did log in a few days ago) [13:05:11] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1009.eqiad.wmnet with OS bookworm [13:05:24] Should I only need to get the password once? [13:05:45] at least until you log out of spiderpig, afaik [13:05:48] (03CR) 10CI reject: [V:04-1] Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [13:06:06] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:12] RECOVERY - Host an-worker1119 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:06:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:20] (03Merged) 10jenkins-bot: sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) (owner: 10JMeybohm) [13:06:21] I must've logged out :( [13:06:46] RECOVERY - SSH on an-worker1119 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:06:51] bunnypranav: are you there? [13:07:31] * Lucas_WMDE tries to understand the IPInfo config changes [13:09:26] (03PS1) 10Effie Mouzeli: admin_ng: add policy for /srv/mediawiki hostPath mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151208 (https://phabricator.wikimedia.org/T395225) [13:09:31] (03CR) 10Gehel: [C:03+1] "Could you already create the patch to re-enable monitoring once we have completed the migration?" [puppet] - 10https://gerrit.wikimedia.org/r/1148402 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:09:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance [13:09:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T395241)', diff saved to https://phabricator.wikimedia.org/P76480 and previous config saved to /var/cache/conftool/dbconfig/20250527-130947-fceratto.json [13:10:19] (03CR) 10Ayounsi: [C:03+1] routed_ganeti: Move more common settings to the common hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1151186 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:10:48] Hey Lucas_WMDE, sorry for the delay. [13:10:48] (03CR) 10Muehlenhoff: [C:03+2] routed_ganeti: Move more common settings to the common hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/1151186 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:11:00] Are you willing to deploy now? [13:11:01] (03CR) 10Lucas Werkmeister (WMDE): Update IPInfo access levels (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:11:13] yeah, let me just take a look at the change [13:11:15] and the task, probably [13:11:34] Ok sure. [13:12:54] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10859334 (10Fabfur) Hi @Jhancock.wm, is there a rough timeline for this tasks so we can organize our work around it? [13:13:02] the old groups aren’t empty yet and I have no idea how MediaWiki handles removal of a group with members [13:13:53] Lucas_WMDE: https://www.mediawiki.org/wiki/Manual:EmptyUserGroup.php [13:14:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T395241)', diff saved to https://phabricator.wikimedia.org/P76481 and previous config saved to /var/cache/conftool/dbconfig/20250527-131420-fceratto.json [13:14:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:14:43] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1004768 is the previous user group removal I could find [13:14:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T395241)', diff saved to https://phabricator.wikimedia.org/P76482 and previous config saved to /var/cache/conftool/dbconfig/20250527-131447-fceratto.json [13:14:59] anzx: no SAL entries for that maintenance script AFAICT 🤔 [13:15:06] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:15:12] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:24] Lucas_WMDE: https://phabricator.wikimedia.org/T356012#9556109 [13:15:41] nevermind, https://sal.toolforge.org/log/mSvkwo0BhuQtenzvifPf [13:15:43] The same commit's task shows the emptyUserGroup.php usage. [13:16:09] https://sal.toolforge.org/production?p=0&q=emptyUserGroup.php&d= has results, https://sal.toolforge.org/production?p=0&q=emptyUserGroup&d= doesn’t [13:16:12] yeah. [13:16:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:16:18] I wonder if we can tweak the elasticsearch config to fix that [13:16:19] but anyway [13:16:40] (03CR) 10Muehlenhoff: [C:03+2] Apply ganeti_routed role to ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1151201 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:17:06] looks like emptyUserGroup only supports one group at a time so I’ll have to run it twice [13:17:17] We relied on emptyUserGroup first for our configs too - ran it just before the window so as not to hold things up [13:17:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T395241)', diff saved to https://phabricator.wikimedia.org/P76483 and previous config saved to /var/cache/conftool/dbconfig/20250527-131723-fceratto.json [13:18:32] I’ll let the wayback machine snapshot ListUsers just in case [13:18:43] so they know who to give the new group to [13:19:06] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:19:12] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:15] The new group has more trustworthy perms, so that group can be scrapped. [13:19:52] Lucas_WMDE: Actually, the task creator said they'll remove all users from the two groups, but seems to have missed/forgotten some. [13:20:06] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, should be okay to deploy once the old groups have been emptied via the `emptyUserGroup` maintenance script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151161 (https://phabricator.wikimedia.org/T395293) (owner: 10Bunnypranav) [13:21:07] (03CR) 10Brouberol: "No sudo rule AFAIK, this is just about group membership and file permissions." [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [13:21:07] (03PS2) 10Brouberol: admin/data: create an airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125) [13:21:07] (03PS2) 10Brouberol: airflow-dev: make kubeconfig group-owned by the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150655 (https://phabricator.wikimedia.org/T395125) [13:21:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T395241)', diff saved to https://phabricator.wikimedia.org/P76484 and previous config saved to /var/cache/conftool/dbconfig/20250527-132122-fceratto.json [13:21:53] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2003-dev.codfw.wmnet with OS bookworm [13:22:06] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:22:12] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:38] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1009.eqiad.wmnet with reason: host reimage [13:23:01] !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --comment=T395293 --follow -- emptyUserGroup eswikivoyage rollbacker # removed 5 users in total [13:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:06] T395293: Create 'revisor' role on eswikivoyage - https://phabricator.wikimedia.org/T395293 [13:23:14] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1006.eqiad.wmnet with reason: host reimage [13:23:24] !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --comment=T395293 --follow -- emptyUserGroup eswikivoyage patroller # removed 3 users in total [13:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:43] (03PS1) 10Muehlenhoff: Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) [13:23:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:24:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151161 (https://phabricator.wikimedia.org/T395293) (owner: 10Bunnypranav) [13:24:26] SpiderPig #101 \o/ [13:24:37] (hashar got #100 earlier this morning ^^) [13:24:47] (03CR) 10CI reject: [V:04-1] Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:25:18] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1009.eqiad.wmnet with reason: host reimage [13:25:27] (03PS1) 10Anzx: slwikibooks: update tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151210 (https://phabricator.wikimedia.org/T393551) [13:25:34] (03Merged) 10jenkins-bot: core-Permissions:Create reviewer role on eswikivoyage, remove patroller and rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151161 (https://phabricator.wikimedia.org/T395293) (owner: 10Bunnypranav) [13:25:59] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1151161|core-Permissions:Create reviewer role on eswikivoyage, remove patroller and rollbacker (T395293)]] [13:26:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151210 (https://phabricator.wikimedia.org/T393551) (owner: 10Anzx) [13:26:50] (03PS2) 10Muehlenhoff: Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) [13:28:01] (03CR) 10CI reject: [V:04-1] Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:28:04] !log lucaswerkmeister-wmde@deploy1003 bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1151161|core-Permissions:Create reviewer role on eswikivoyage, remove patroller and rollbacker (T395293)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:09] T395293: Create 'revisor' role on eswikivoyage - https://phabricator.wikimedia.org/T395293 [13:28:11] bunnypranav: please test :) [13:28:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1006.eqiad.wmnet with reason: host reimage [13:29:11] On it! [13:29:38] (03PS3) 10Ayounsi: Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) [13:29:40] (03PS3) 10Muehlenhoff: Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) [13:30:12] Lucas_WMDE: All good, please continue. [13:30:16] !log lucaswerkmeister-wmde@deploy1003 bunnypranav, lucaswerkmeister-wmde: Continuing with sync [13:30:18] great, thanks [13:30:25] Thanks for the deploy! [13:30:35] Tchanders: FYI, I left some comments on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1146969, in case you didn’t see them yet [13:30:52] Lucas_WMDE: i have added one more patch, both can be done together [13:31:47] ack [13:32:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P76485 and previous config saved to /var/cache/conftool/dbconfig/20250527-133231-fceratto.json [13:33:05] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bookworm [13:33:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:33:26] (03CR) 10Tchanders: Update IPInfo access levels (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:33:40] Lucas_WMDE: Thanks, I've replied. [13:33:42] (03PS5) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) [13:33:51] My 2 patches can also go together btw [13:33:51] (03CR) 10Máté Szabó: Update IPInfo access levels (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:34:22] I’m trying to see what MediaWiki will do with $wgAddGroups['sysop'] including a group even on wikis where (IIUC) the group isn’t defined [13:34:25] so far my localhost isn’t behavior [13:34:27] *behaving [13:35:28] I also have a meeting in 10 minutes so I’m not sure I’ll be able to deploy anything else :/ [13:35:28] (03CR) 10Tchanders: [C:03+1] Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:36:12] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P76486 and previous config saved to /var/cache/conftool/dbconfig/20250527-133629-fceratto.json [13:36:54] jouncebot: nowandnext [13:36:54] For the next 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1300) [13:36:54] In 1 hour(s) and 23 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1500) [13:37:04] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:37:12] temporary-account-viewer exists on all wikis [13:37:14] https://en.wikipedia.org/wiki/Special:ListGroupRights [13:37:25] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151161|core-Permissions:Create reviewer role on eswikivoyage, remove patroller and rollbacker (T395293)]] (duration: 11m 25s) [13:37:30] T395293: Create 'revisor' role on eswikivoyage - https://phabricator.wikimedia.org/T395293 [13:37:46] Tchanders: I thought it was provided by CheckUser? [13:37:47] (03CR) 10Marostegui: [C:03+2] mariadb: Move db2186 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/1151198 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [13:37:48] Lucas_WMDE: Do you mean no more including mine, or no more after mine? (Thank you either way!) [13:37:51] or did I misunderstand the other change [13:37:59] yes, it's from CheckUser [13:37:59] Tchanders: possibly no more including yours :( [13:38:38] wmgUseCheckUser is on in all production [13:38:45] No worries - thanks for helping! [13:39:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10859399 (10Volans) Yes I think it makes sense to modify `BiosNvmeDriver` from `DellQualifiedDrives` to `AllDrives` when present at this point. Is it ok to leave it as all dr... [13:40:07] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM but I may not be able to deploy it today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:40:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [13:41:23] (03CR) 10Lucas Werkmeister (WMDE): Temp accounts: Allow sysop to grant and revoke IP reveal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [13:41:46] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:49] (03CR) 10Dreamy Jazz: [C:03+1] Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:41:54] Tchanders: are we ready to deploy, or is someone syncing changes now? [13:42:20] I believe we're good to go - we were about to deploy next [13:42:24] ok [13:42:31] SpiderPig should say? [13:42:32] (03CR) 10Dreamy Jazz: [C:03+1] Temp accounts: Allow sysop to grant and revoke IP reveal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [13:42:32] I’m done deploying [13:42:35] (03CR) 10Dreamy Jazz: Temp accounts: Allow sysop to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [13:42:37] sorry i couldn’t get more done [13:42:38] (03CR) 10Dreamy Jazz: [C:04-1] Temp accounts: Allow sysop to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [13:43:06] kostajh: do you want to take over? [13:43:08] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1009.eqiad.wmnet with OS bookworm [13:43:12] Lucas_WMDE: yes [13:43:15] ok :) [13:43:18] Tchanders: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1149699 has a -1 [13:43:19] Lucas_WMDE: No worries, one of our team will do it! [13:43:29] Tchanders: I'll start with https://gerrit.wikimedia.org/r/c/1146969/, ok ? [13:43:53] Ok - I'll fix the other one... [13:43:59] heh, I just noticed those two changes have almost the same number, only one pair of adjacent digits transmuted ^^ [13:44:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:44:46] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:29] (03Merged) 10jenkins-bot: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:45:42] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:45:44] (03CR) 10Dreamy Jazz: [C:04-1] Temp accounts: Allow sysop to grant and revoke IP reveal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [13:45:46] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:53] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1146969|Update IPInfo access levels (T375086)]] [13:45:57] T375086: Bring IP Info access permissions to parity with the IP Reveal feature - https://phabricator.wikimedia.org/T375086 [13:46:17] (03CR) 10Bking: [C:03+2] "Sure! We've already noted it in https://phabricator.wikimedia.org/T391350 as well." [puppet] - 10https://gerrit.wikimedia.org/r/1148402 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:46:36] (03CR) 10Dreamy Jazz: [C:04-1] Temp accounts: Allow sysop to grant and revoke IP reveal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [13:46:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2191.codfw.wmnet onto db2186.codfw.wmnet [13:46:55] (03PS4) 10Ayounsi: Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:47:17] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1009.eqiad.wmnet [13:47:18] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1009.eqiad.wmnet [13:47:28] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:47:30] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1008.eqiad.wmnet [13:47:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P76487 and previous config saved to /var/cache/conftool/dbconfig/20250527-134738-fceratto.json [13:47:59] !log kharlan@deploy1003 mszabo, kharlan: Backport for [[gerrit:1146969|Update IPInfo access levels (T375086)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:48:00] (03PS1) 10Marostegui: instances.yaml: Add db2186 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1151217 (https://phabricator.wikimedia.org/T394884) [13:48:55] Tchanders mszabo please verify the change [13:49:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1006.eqiad.wmnet with OS bookworm [13:49:19] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2186 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1151217 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [13:50:30] (03PS1) 10Klausman: role::ml_k8s::worker: upgrade ml-serve1008 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151216 (https://phabricator.wikimedia.org/T387854) [13:51:34] kostajh: Looks good [13:51:40] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [13:51:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2186 to dbctl depooled T394884', diff saved to https://phabricator.wikimedia.org/P76488 and previous config saved to /var/cache/conftool/dbconfig/20250527-135141-marostegui.json [13:51:46] T394884: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884 [13:51:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P76489 and previous config saved to /var/cache/conftool/dbconfig/20250527-135151-fceratto.json [13:52:13] Tchanders: thnanks, will give mszabo a few more minutes [13:52:34] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1008.eqiad.wmnet [13:53:08] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [13:53:10] kostajh: looks ok, the group rights are unchanged on enwiki and changed as described on fawiktionary [13:53:52] !log kharlan@deploy1003 mszabo, kharlan: Continuing with sync [13:54:01] thanks, syncing [13:55:31] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1008.eqiad.wmnet with OS bookworm [13:56:19] (03PS17) 10Ssingh: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [13:56:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [13:57:59] (03CR) 10JMeybohm: "A comment and some whitespace nits here and there" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:59:47] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:00:14] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [14:00:33] (03CR) 10JMeybohm: [C:03+1] wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:00:51] (03CR) 10Alexandros Kosiaris: functions-orchestrator: add mcrouter module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [14:01:10] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146969|Update IPInfo access levels (T375086)]] (duration: 15m 16s) [14:01:14] T375086: Bring IP Info access permissions to parity with the IP Reveal feature - https://phabricator.wikimedia.org/T375086 [14:01:20] I'm done with deploys [14:01:23] and handing it over to mszabo [14:01:27] (03CR) 10JMeybohm: [C:03+1] "Whitespace nit, otherwise fine by me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151208 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [14:01:48] let's go [14:01:51] (03CR) 10Ssingh: "Addressed some of my own comments here to get this moving. NOOP for non-single backend sites as expected with diffs for single backend sit" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [14:02:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T395241)', diff saved to https://phabricator.wikimedia.org/P76490 and previous config saved to /var/cache/conftool/dbconfig/20250527-140244-fceratto.json [14:03:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:03:10] (03CR) 10Máté Szabó: [C:03+1] Temp accounts: Allow sysop to grant and revoke IP reveal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [14:03:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:03:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T395241)', diff saved to https://phabricator.wikimedia.org/P76491 and previous config saved to /var/cache/conftool/dbconfig/20250527-140328-fceratto.json [14:04:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [14:04:07] (03CR) 10JMeybohm: "Bunch of whitespace nits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [14:04:56] (03Merged) 10jenkins-bot: Temp accounts: Allow sysop to grant and revoke IP reveal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [14:05:20] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1149699|Temp accounts: Allow sysop to grant and revoke IP reveal (T390942)]] [14:05:21] (03CR) 10Ssingh: "Addressed in current CR." [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [14:05:24] T390942: Allow IP viewer temporary account group to be manually granted on all projects - https://phabricator.wikimedia.org/T390942 [14:05:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:41] (03PS1) 10Tchanders: Temp accounts: Remove temporary-account-viewer from labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151220 (https://phabricator.wikimedia.org/T390942) [14:06:44] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10859545 (10Jhancock.wm) @Fabfur I'm planning on racking them tomorrow. I have a few servers ahead of it in the imaging queue but I'm hoping to have everything wrapped u... [14:06:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T395241)', diff saved to https://phabricator.wikimedia.org/P76492 and previous config saved to /var/cache/conftool/dbconfig/20250527-140658-fceratto.json [14:07:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [14:07:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T395241)', diff saved to https://phabricator.wikimedia.org/P76493 and previous config saved to /var/cache/conftool/dbconfig/20250527-140725-fceratto.json [14:07:29] (03CR) 10ZhaoFJx: [C:03+1] Allow itwiki bureaucrat to remove sysop permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [14:07:32] !log mszabo@deploy1003 mszabo, tchanders: Backport for [[gerrit:1149699|Temp accounts: Allow sysop to grant and revoke IP reveal (T390942)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:07:38] (03CR) 10Tchanders: Temp accounts: Allow sysop to grant and revoke IP reveal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149699 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [14:07:39] (03PS5) 10Ayounsi: Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [14:07:39] (03PS1) 10Ayounsi: profile::ganeti: add magru to v6_prefixes [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) [14:07:48] (03CR) 10Brouberol: [C:03+2] admin/data: create an airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [14:08:05] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:08:11] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:26] (03CR) 10Tchanders: "Not entirely sure if this is needed, but putting it up in case it is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151220 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [14:09:10] checking [14:09:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10859557 (10Jhancock.wm) a:05jijiki→03Jhancock.wm [14:10:10] !log mszabo@deploy1003 mszabo, tchanders: Continuing with sync [14:10:35] (03CR) 10Jelto: "A ready-only mode would be nice. Although the last commits are quite some time ago. I hope it's still supported." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [14:11:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T395241)', diff saved to https://phabricator.wikimedia.org/P76494 and previous config saved to /var/cache/conftool/dbconfig/20250527-141104-fceratto.json [14:11:34] (03CR) 10CI reject: [V:04-1] Temp accounts: Remove temporary-account-viewer from labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151220 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [14:12:11] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:05] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:13:06] (03PS2) 10Tchanders: Temp accounts: Remove temporary-account-viewer from labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151220 (https://phabricator.wikimedia.org/T390942) [14:13:42] mszabo: I'm seeing beta wikis having the user right for the admin group: https://en.wikipedia.beta.wmflabs.org/wiki/Special:ListGroupRights [14:13:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T395241)', diff saved to https://phabricator.wikimedia.org/P76495 and previous config saved to /var/cache/conftool/dbconfig/20250527-141346-fceratto.json [14:13:58] For some reason the message is defined there even though CheckUser is not loaded(?) [14:14:01] That seems like a bug [14:14:39] Hang on that's not related to this change [14:15:17] So was present before these changes [14:16:05] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:16:11] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:21] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149699|Temp accounts: Allow sysop to grant and revoke IP reveal (T390942)]] (duration: 12m 00s) [14:17:26] T390942: Allow IP viewer temporary account group to be manually granted on all projects - https://phabricator.wikimedia.org/T390942 [14:17:36] (03PS18) 10Ssingh: conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [14:18:50] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2002-dev.codfw.wmnet with OS bookworm [14:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:20:25] (03PS1) 10Bking: Data-platform: Route cirrussearch/elastic alerts to DPE SRE [puppet] - 10https://gerrit.wikimedia.org/r/1151225 (https://phabricator.wikimedia.org/T395309) [14:21:35] (03PS1) 10Vgutierrez: hiera: Enable edge uniques in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1151226 (https://phabricator.wikimedia.org/T391411) [14:21:49] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [14:22:40] (03CR) 10CI reject: [V:04-1] hiera: Enable edge uniques in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1151226 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:22:54] hmm, mw-web getting hot. rps up [14:22:57] (03PS2) 10Vgutierrez: hiera: Enable edge uniques in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1151226 (https://phabricator.wikimedia.org/T391411) [14:23:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151226 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:24:29] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 87613656 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:25:29] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7183664 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:26:10] (03PS1) 10Federico Ceratto: sre.mysql.upgrade Make --task-id optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1151219 (https://phabricator.wikimedia.org/T395325) [14:26:10] (03CR) 10Federico Ceratto: [C:03+1] "A simple change as described" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151219 (https://phabricator.wikimedia.org/T395325) (owner: 10Federico Ceratto) [14:26:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P76496 and previous config saved to /var/cache/conftool/dbconfig/20250527-142612-fceratto.json [14:26:20] (03CR) 10Ssingh: [C:03+1] hiera: Enable edge uniques in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1151226 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:27:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [14:27:46] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1008.eqiad.wmnet with reason: host reimage [14:27:54] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1151225 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [14:28:04] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1151226 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:28:38] ok to merge Brouberol: admin/data: create an airflow-deployers group (a6d1cd9689) :? [14:28:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P76497 and previous config saved to /var/cache/conftool/dbconfig/20250527-142852-fceratto.json [14:30:48] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10859608 (10SDeckelmann-WMF) I logged out of both and logged back in, and that fixed it! I did logout from idp.wikimedia... [14:31:01] (03CR) 10Bking: [C:03+2] Data-platform: Route cirrussearch/elastic alerts to DPE SRE [puppet] - 10https://gerrit.wikimedia.org/r/1151225 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [14:32:51] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1008.eqiad.wmnet with reason: host reimage [14:33:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:34:32] !log Deploy schema change on s7 eqiad dbmaint T395333 [14:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:37] T395333: Update gb_id to be unsigned in the globalblocks table on WMF production - https://phabricator.wikimedia.org/T395333 [14:35:01] (03PS1) 10Brouberol: Airflow: don't deploy the plain envoy service in a devenv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151229 [14:36:15] (03CR) 10Cathal Mooney: [C:03+1] profile::ganeti: add magru to v6_prefixes [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [14:37:21] (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151230 (https://phabricator.wikimedia.org/T393089) [14:41:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P76498 and previous config saved to /var/cache/conftool/dbconfig/20250527-144119-fceratto.json [14:42:43] (03CR) 10Tiziano Fogli: [C:03+1] Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [14:43:13] (03CR) 10Hasan Akgün (WMDE): [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151230 (https://phabricator.wikimedia.org/T393089) (owner: 10Lucas Werkmeister (WMDE)) [14:43:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151230 (https://phabricator.wikimedia.org/T393089) (owner: 10Lucas Werkmeister (WMDE)) [14:44:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P76499 and previous config saved to /var/cache/conftool/dbconfig/20250527-144359-fceratto.json [14:44:50] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Deploying_with_helmfile question, if anyone’s listening: [14:45:04] if I don’t need to test between staging/eqiad/codfw clusters, can I just leave out the -e flag altogether to deploy to all of them at once? [14:45:14] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151230 (https://phabricator.wikimedia.org/T393089) (owner: 10Lucas Werkmeister (WMDE)) [14:45:24] (use case: wikidata-query-gui, where AFAIK the staging cluster isn’t reachable at all) [14:46:01] hm, “defaults to "default"” from the --help sounds like it might not do the right thing :S [14:46:55] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1110 to cirrussearch1110 [14:46:55] (03PS1) 10Santiago Faci: xLab: Reduce staging/production logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151232 (https://phabricator.wikimedia.org/T394425) [14:47:08] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:47:08] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [14:47:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 814.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:47:23] *whoa* that’s a big diff [14:47:26] * Lucas_WMDE looks through ops-l backlog [14:48:54] I’m guessing T391333 is what I’m seeing [14:48:55] T391333: Revisit default envoy histogram buckets - https://phabricator.wikimedia.org/T391333 [14:49:13] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [14:49:38] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [14:50:02] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [14:50:15] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1110 to cirrussearch1110 - bking@cumin2002" [14:50:27] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [14:50:34] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1110 to cirrussearch1110 - bking@cumin2002" [14:50:34] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:35] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1110 on all recursors [14:50:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1110 on all recursors [14:50:39] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1110 [14:51:06] Lucas_WMDE: can I just leave out the -e flag << No you can't [14:51:08] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1110 [14:51:13] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [14:51:14] claime: ok, thanks [14:51:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1110 to cirrussearch1110 [14:52:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 814.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:52:45] Lucas_WMDE: I’m guessing T391333 is what I’m seeing << very probably yes [14:52:48] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1110.eqiad.wmnet with OS bullseye [14:52:53] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1110 [14:52:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1110 [14:54:26] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:55:01] (03CR) 10BCornwall: [C:03+1] templates: lower TTLs for dyna.wm.org and upload.wm.org to 180s [dns] - 10https://gerrit.wikimedia.org/r/1150701 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [14:56:22] (03PS1) 10Ayounsi: Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) [14:56:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T395241)', diff saved to https://phabricator.wikimedia.org/P76500 and previous config saved to /var/cache/conftool/dbconfig/20250527-145626-fceratto.json [14:56:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:56:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T395241)', diff saved to https://phabricator.wikimedia.org/P76501 and previous config saved to /var/cache/conftool/dbconfig/20250527-145651-fceratto.json [14:57:44] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [14:58:35] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1006.eqiad.wmnet [14:58:36] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1006.eqiad.wmnet [14:59:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T395241)', diff saved to https://phabricator.wikimedia.org/P76502 and previous config saved to /var/cache/conftool/dbconfig/20250527-145906-fceratto.json [14:59:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [14:59:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T395241)', diff saved to https://phabricator.wikimedia.org/P76503 and previous config saved to /var/cache/conftool/dbconfig/20250527-145933-fceratto.json [15:00:04] jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for SRE Collaboration Services office hours . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1500). [15:02:51] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T395241)', diff saved to https://phabricator.wikimedia.org/P76504 and previous config saved to /var/cache/conftool/dbconfig/20250527-150325-fceratto.json [15:05:46] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1148490 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [15:05:48] (03CR) 10Scott French: [C:03+2] deployment_server: Call into the mwscript helper from mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148490 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [15:06:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T395241)', diff saved to https://phabricator.wikimedia.org/P76505 and previous config saved to /var/cache/conftool/dbconfig/20250527-150602-fceratto.json [15:06:51] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:39] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10859791 (10BCornwall) Hi, @RobH! Would it help if I were to get in contact with Dell to troubleshoot? [15:08:51] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:18] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1110.eqiad.wmnet with reason: host reimage [15:09:52] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1008.eqiad.wmnet with OS bookworm [15:10:23] (03PS1) 10Krinkle: noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) [15:12:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1110.eqiad.wmnet with reason: host reimage [15:12:16] (03CR) 10Hamish: [C:03+1] Allow itwiki bureaucrat to remove sysop permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [15:12:34] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1008.eqiad.wmnet [15:12:35] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1008.eqiad.wmnet [15:13:42] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1007.eqiad.wmnet [15:13:51] (03PS2) 10Klausman: role::ml_k8s::worker: upgrade ml-serve1007 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151237 (https://phabricator.wikimedia.org/T387854) [15:14:10] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db2191 gradually with 4 steps - Pool db2191.codfw.wmnet in after cloning [15:14:51] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:15:13] (03CR) 10Hnowlan: [C:03+1] pcs/RB sunset: Remove unnecessary definition rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151177 (owner: 10Jgiannelos) [15:17:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:04] (03CR) 10Dduvall: "@mmuhlenhoff@wikimedia.org thoughts on the above?" [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall) [15:18:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P76507 and previous config saved to /var/cache/conftool/dbconfig/20250527-151832-fceratto.json [15:21:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P76508 and previous config saved to /var/cache/conftool/dbconfig/20250527-152110-fceratto.json [15:23:47] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1007.eqiad.wmnet [15:24:36] (03CR) 10Dzahn: "ACK, thanks all! I just consider this on hold for the actual decision on the ticket." [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [15:28:06] (03CR) 10Klausman: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1007 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1151237 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [15:28:52] (03CR) 10Jforrester: [C:03+1] noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [15:29:06] (03PS1) 10Cyndywikime: Config: Enable starter difficulty for newcomer tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151244 (https://phabricator.wikimedia.org/T393769) [15:30:06] (03PS2) 10Jforrester: Wikifunctions: Enable Wikifunction client mode on the first five Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148423 (https://phabricator.wikimedia.org/T390552) [15:30:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148951 (https://phabricator.wikimedia.org/T391913) (owner: 10Jforrester) [15:30:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148423 (https://phabricator.wikimedia.org/T390552) (owner: 10Jforrester) [15:31:17] (03Merged) 10jenkins-bot: [wikifunctions] Don't grant new generic-enum rights to Functioneers for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148951 (https://phabricator.wikimedia.org/T391913) (owner: 10Jforrester) [15:31:24] (03Merged) 10jenkins-bot: Wikifunctions: Enable Wikifunction client mode on the first five Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148423 (https://phabricator.wikimedia.org/T390552) (owner: 10Jforrester) [15:31:46] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1148951|[wikifunctions] Don't grant new generic-enum rights to Functioneers for now (T391913)]], [[gerrit:1148423|Wikifunctions: Enable Wikifunction client mode on the first five Wiktionaries (T390552)]] [15:31:52] T391913: [PHP]: Add rights for creation & editing of lightweight enum types - https://phabricator.wikimedia.org/T391913 [15:31:52] T390552: Make embedded Wikifunctions available in at least five more Wikimedia projects, to learn from other languages and communities - https://phabricator.wikimedia.org/T390552 [15:32:54] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bookworm [15:33:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P76510 and previous config saved to /var/cache/conftool/dbconfig/20250527-153339-fceratto.json [15:33:55] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1148951|[wikifunctions] Don't grant new generic-enum rights to Functioneers for now (T391913)]], [[gerrit:1148423|Wikifunctions: Enable Wikifunction client mode on the first five Wiktionaries (T390552)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:34:16] (03PS11) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (https://phabricator.wikimedia.org/T341984) [15:35:25] !log jforrester@deploy1003 jforrester: Continuing with sync [15:36:06] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10859892 (10bking) I committed to providing a status update for this ticket in our DPE SRE standup today. So here goes! We need to know: - /... [15:36:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P76511 and previous config saved to /var/cache/conftool/dbconfig/20250527-153618-fceratto.json [15:36:33] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10859894 (10bking) [15:36:45] (03CR) 10Tchanders: "We're ready to do this: T386492#10859888" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) (owner: 10Tchanders) [15:37:13] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:33] (03CR) 10Tchanders: "We could schedule it for next week, to roughly align with I4e987677bf5b97c8af55aef24cffc0f17258eeb3 riding the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) (owner: 10Tchanders) [15:41:35] (03CR) 10Cathal Mooney: [C:03+1] Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [15:42:36] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148951|[wikifunctions] Don't grant new generic-enum rights to Functioneers for now (T391913)]], [[gerrit:1148423|Wikifunctions: Enable Wikifunction client mode on the first five Wiktionaries (T390552)]] (duration: 10m 50s) [15:42:44] T391913: [PHP]: Add rights for creation & editing of lightweight enum types - https://phabricator.wikimedia.org/T391913 [15:42:45] T390552: Make embedded Wikifunctions available in at least five more Wikimedia projects, to learn from other languages and communities - https://phabricator.wikimedia.org/T390552 [15:44:13] jforrester: Okay for me to deploy a scap update? [15:44:42] James_F: ^ [15:44:48] Yeah, I'm all done. [15:45:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1110.eqiad.wmnet with OS bullseye [15:45:30] thx [15:45:37] !log dancy@deploy1003 Installing scap version "4.170.0" for 2 host(s) [15:45:41] (03CR) 10Alexandros Kosiaris: [C:03+1] noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [15:45:48] (03CR) 10Marostegui: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151219 (https://phabricator.wikimedia.org/T395325) (owner: 10Federico Ceratto) [15:47:28] !log dancy@deploy1003 Installation of scap version "4.170.0" completed for 2 hosts [15:48:25] jouncebot: nowandnext [15:48:25] For the next 0 hour(s) and 11 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1500) [15:48:25] In 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1600) [15:48:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T395241)', diff saved to https://phabricator.wikimedia.org/P76513 and previous config saved to /var/cache/conftool/dbconfig/20250527-154846-fceratto.json [15:49:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:49:11] unless there are any objections, I'd like to run a "noop" scap deployment to validate things work as expected with the release of 4.170.0 [15:49:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T395241)', diff saved to https://phabricator.wikimedia.org/P76514 and previous config saved to /var/cache/conftool/dbconfig/20250527-154912-fceratto.json [15:49:55] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10859945 (10Jhancock.wm) @cmooney got it set and confirmed it pings [15:50:40] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10859956 (10cmooney) >>! In T394021#10859945, @Jhancock.wm wrote: > @cmooney got it set and confirmed it pings awesome, thank you! [15:51:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T395241)', diff saved to https://phabricator.wikimedia.org/P76515 and previous config saved to /var/cache/conftool/dbconfig/20250527-155125-fceratto.json [15:51:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [15:51:58] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on backup[1001-1003].eqiad.wmnet with reason: Downtime hosts for reboot [15:52:05] alright, I'm going to proceed shortly [15:52:48] !log swfrench@deploy1003 Started scap sync-world: Noop deployment to test scap 4.170.0 - T388761 [15:52:53] T388761: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761 [15:54:55] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1007.eqiad.wmnet with reason: host reimage [15:55:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T395241)', diff saved to https://phabricator.wikimedia.org/P76516 and previous config saved to /var/cache/conftool/dbconfig/20250527-155546-fceratto.json [15:56:10] (03PS4) 10Ssingh: trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 [15:56:21] !log swfrench@deploy1003 Finished scap sync-world: Noop deployment to test scap 4.170.0 - T388761 (duration: 04m 03s) [15:56:33] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [15:56:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [15:56:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [15:56:53] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: /srv 282545 MB (3% inode=88%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [15:57:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [15:57:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T395241)', diff saved to https://phabricator.wikimedia.org/P76517 and previous config saved to /var/cache/conftool/dbconfig/20250527-155720-fceratto.json [15:57:24] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5689/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [15:57:44] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1007.eqiad.wmnet with reason: host reimage [15:58:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [15:58:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:58:52] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.upgrade Make --task-id optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1151219 (https://phabricator.wikimedia.org/T395325) (owner: 10Federico Ceratto) [15:59:33] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10860046 (10Dzahn) uhm, yea, that is a mistake. not sure how that happened since I think i just went "cursor up" to edit... [15:59:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2191 gradually with 4 steps - Pool db2191.codfw.wmnet in after cloning [15:59:42] FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:00:05] jhathaway and moritzm: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:42] (03PS2) 10Cyndywikime: Growth-Beta: Enable starter difficulty for newcomer tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151244 (https://phabricator.wikimedia.org/T393769) [16:01:22] (03PS2) 10Scott French: mediawiki: Remove backwards compatibility path for running php directly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148491 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [16:01:22] (03CR) 10Scott French: "I believe this should wrap up the remainder of the TODOs in [0]. Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148491 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [16:01:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall) [16:01:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx) [16:02:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151210 (https://phabricator.wikimedia.org/T393551) (owner: 10Anzx) [16:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T395241)', diff saved to https://phabricator.wikimedia.org/P76519 and previous config saved to /var/cache/conftool/dbconfig/20250527-160401-fceratto.json [16:04:42] RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:51] moritzm: thanks for the review! [16:05:07] (03CR) 10Dzahn: [C:03+2] admin: replace SSH key for seddon [puppet] - 10https://gerrit.wikimedia.org/r/1149736 (https://phabricator.wikimedia.org/T393579) (owner: 10Dzahn) [16:05:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2186.codfw.wmnet [16:06:00] yw, sorry this got a little backlogged [16:06:46] oh np. i pivoted to other half-finished things in the meantime :) [16:08:56] (03CR) 10Ecarg: [C:03+1] "thank you" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [16:09:11] marostegui@cumin1002 upgrade (PID 3364633) is awaiting input [16:09:46] (03CR) 10Ecarg: [C:03+2] wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [16:10:02] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on backup[2001-2003].codfw.wmnet,backup1013.eqiad.wmnet with reason: Downtime hosts for reboot [16:10:05] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10860149 (10Dzahn) Hey @Seddon your key has been updated. Within the next half hour it will have been replaced on all systems that had the old key. [16:10:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10860150 (10Dzahn) 05In progress→03Resolved a:03Dzahn [16:10:40] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on backup2013.codfw.wmnet with reason: Downtime hosts for reboot [16:10:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P76520 and previous config saved to /var/cache/conftool/dbconfig/20250527-161055-fceratto.json [16:11:19] (03Merged) 10jenkins-bot: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [16:12:03] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [16:13:19] (03PS1) 10Vgutierrez: varnish: set SameSite=None for edge unique cookie in upload [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) [16:13:29] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Remove backwards compatibility path for running php directly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148491 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [16:13:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:14:13] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:14:33] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1007.eqiad.wmnet with OS bookworm [16:14:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2186.codfw.wmnet [16:16:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1103.eqiad.wmnet with reason: host reimage [16:16:31] (03PS1) 10Bking: elastic/cirrussearch: re-enable monitoring for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151256 (https://phabricator.wikimedia.org/T388610) [16:16:53] (03PS1) 10Jcrespo: dbbackups: Increase retention of es backups to 18 days [puppet] - 10https://gerrit.wikimedia.org/r/1151257 [16:17:08] (03PS2) 10Vgutierrez: varnish: set SameSite=None for edge unique cookie in upload [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) [16:19:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:19:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P76521 and previous config saved to /var/cache/conftool/dbconfig/20250527-161909-fceratto.json [16:19:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [16:21:33] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1007.eqiad.wmnet [16:21:34] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1007.eqiad.wmnet [16:22:04] (03CR) 10Scott French: [C:03+2] "Resolving this, as it sounds like this should be fine. Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall) [16:26:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P76522 and previous config saved to /var/cache/conftool/dbconfig/20250527-162602-fceratto.json [16:28:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1067-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:31:39] (03PS1) 10Santiago Faci: xLab: Deploying v0.6.1 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151263 (https://phabricator.wikimedia.org/T372952) [16:33:05] (03CR) 10Bking: [C:04-1] "Do not merge until we actually finish the migration." [puppet] - 10https://gerrit.wikimedia.org/r/1151256 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:34:00] (03PS2) 10Santiago Faci: xLab: Deploying v0.6.1 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151263 (https://phabricator.wikimedia.org/T372952) [16:34:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P76523 and previous config saved to /var/cache/conftool/dbconfig/20250527-163416-fceratto.json [16:34:38] (03CR) 10Jcrespo: [C:03+2] dbbackups: Increase retention of es backups to 18 days [puppet] - 10https://gerrit.wikimedia.org/r/1151257 (owner: 10Jcrespo) [16:35:29] (03CR) 10Ssingh: [C:03+1] varnish: set SameSite=None for edge unique cookie in upload [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:36:11] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v0.6.1 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151263 (https://phabricator.wikimedia.org/T372952) (owner: 10Santiago Faci) [16:36:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1103.eqiad.wmnet with OS bullseye [16:36:38] (03PS1) 10David Caro: tools: update prometheus k8s cert [puppet] - 10https://gerrit.wikimedia.org/r/1151264 [16:37:15] (03CR) 10BBlack: [C:03+1] varnish: set SameSite=None for edge unique cookie in upload [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:37:44] (03Merged) 10jenkins-bot: xLab: Deploying v0.6.1 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151263 (https://phabricator.wikimedia.org/T372952) (owner: 10Santiago Faci) [16:38:10] (03CR) 10Vgutierrez: [C:03+2] varnish: set SameSite=None for edge unique cookie in upload [puppet] - 10https://gerrit.wikimedia.org/r/1151255 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:38:58] (03PS2) 10David Caro: tools: update prometheus k8s cert [puppet] - 10https://gerrit.wikimedia.org/r/1151264 (https://phabricator.wikimedia.org/T395227) [16:40:20] (03CR) 10Majavah: [C:03+1] tools: update prometheus k8s cert [puppet] - 10https://gerrit.wikimedia.org/r/1151264 (https://phabricator.wikimedia.org/T395227) (owner: 10David Caro) [16:40:35] (03CR) 10David Caro: [C:03+2] tools: update prometheus k8s cert [puppet] - 10https://gerrit.wikimedia.org/r/1151264 (https://phabricator.wikimedia.org/T395227) (owner: 10David Caro) [16:41:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T395241)', diff saved to https://phabricator.wikimedia.org/P76524 and previous config saved to /var/cache/conftool/dbconfig/20250527-164110-fceratto.json [16:41:25] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10860346 (10spatton) Approved! Thanks for the reminder, @MoritzMuehlenhoff :) [16:41:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:41:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T395241)', diff saved to https://phabricator.wikimedia.org/P76525 and previous config saved to /var/cache/conftool/dbconfig/20250527-164136-fceratto.json [16:41:44] (03CR) 10BBlack: [C:03+1] templates: lower TTLs for dyna.wm.org and upload.wm.org to 180s [dns] - 10https://gerrit.wikimedia.org/r/1150701 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [16:42:33] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:42:55] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:43:56] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1022-1025].eqiad.wmnet [16:45:08] Hey folks. We got a request from an event organizer to restore data deleted by mistake from a MW database. This would be a simple UPDATE query that I posted in https://phabricator.wikimedia.org/T395350#10860354. It is a relatively urgent request. Could I go ahead and run the query? [16:47:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T395241)', diff saved to https://phabricator.wikimedia.org/P76526 and previous config saved to /var/cache/conftool/dbconfig/20250527-164757-fceratto.json [16:48:07] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10860400 (10RobH) >>! In T394432#10859892, @bking wrote: > I committed to providing a status update for this ticket in our DPE SRE standup to... [16:48:48] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1022-1025].eqiad.wmnet [16:49:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T395241)', diff saved to https://phabricator.wikimedia.org/P76527 and previous config saved to /var/cache/conftool/dbconfig/20250527-164923-fceratto.json [16:49:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [16:49:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T395241)', diff saved to https://phabricator.wikimedia.org/P76528 and previous config saved to /var/cache/conftool/dbconfig/20250527-164950-fceratto.json [16:51:40] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul2001.codfw.wmnet with OS bookworm [16:51:47] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10860424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul2001.co... [16:55:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T395241)', diff saved to https://phabricator.wikimedia.org/P76529 and previous config saved to /var/cache/conftool/dbconfig/20250527-165614-fceratto.json [16:58:02] (03PS1) 10Scott French: aptrepo: temporarily remove failing docker-(bookworm|trixie) updates [puppet] - 10https://gerrit.wikimedia.org/r/1151270 (https://phabricator.wikimedia.org/T392526) [16:58:21] (03CR) 10Jasmine: [C:03+2] wikikube: decommission wikikube-worker102[2-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1114788 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [16:58:43] (03CR) 10BPirkle: [C:03+1] trafficserver: restbaseless reading lists API for ~group1 [puppet] - 10https://gerrit.wikimedia.org/r/1149624 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [16:58:48] (03CR) 10BPirkle: [C:03+1] trafficserver: restbaseless reading lists API for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1149625 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [16:59:18] (03CR) 10Dduvall: [C:03+1] aptrepo: temporarily remove failing docker-(bookworm|trixie) updates [puppet] - 10https://gerrit.wikimedia.org/r/1151270 (https://phabricator.wikimedia.org/T392526) (owner: 10Scott French) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1700) [17:00:51] (03CR) 10Dzahn: [C:03+1] airflow-dev: make kubeconfig group-owned by the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150655 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [17:01:41] (03CR) 10Scott French: [C:03+2] aptrepo: temporarily remove failing docker-(bookworm|trixie) updates [puppet] - 10https://gerrit.wikimedia.org/r/1151270 (https://phabricator.wikimedia.org/T392526) (owner: 10Scott French) [17:01:45] !log restore row per request T395350 [17:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:54] T395350: Restore event registration accidentally deleted while moving pages - https://phabricator.wikimedia.org/T395350 [17:02:42] marostegui@cumin1002 clone (PID 3220522) is awaiting input [17:03:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P76530 and previous config saved to /var/cache/conftool/dbconfig/20250527-170304-fceratto.json [17:03:54] !log jasmine@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1022-1025].eqiad.wmnet [17:05:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:05:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:31] jasmine@cumin1002 decommission (PID 3422444) is awaiting input [17:08:07] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2001.codfw.wmnet with reason: host reimage [17:08:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [17:08:58] (03PS3) 10Jforrester: [BETA CLUSTER] Close en_rtlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 [17:11:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P76531 and previous config saved to /var/cache/conftool/dbconfig/20250527-171121-fceratto.json [17:11:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: host reimage [17:13:13] (03PS1) 10AOkoth: doc: fix php version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) [17:15:52] (03PS1) 10Ebernhardson: cirrus: re-enable daily completion suggester builds in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151274 [17:16:04] (03PS2) 10AOkoth: doc: fix php version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) [17:16:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:49] (03PS2) 10Bking: cirrus: re-enable daily completion suggester builds in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151274 (https://phabricator.wikimedia.org/T391350) (owner: 10Ebernhardson) [17:18:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P76532 and previous config saved to /var/cache/conftool/dbconfig/20250527-171812-fceratto.json [17:20:28] (03PS3) 10AOkoth: doc: fix php version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) [17:20:41] (03CR) 10CI reject: [V:04-1] cirrus: re-enable daily completion suggester builds in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151274 (https://phabricator.wikimedia.org/T391350) (owner: 10Ebernhardson) [17:21:15] (03PS3) 10Bking: cirrus: re-enable daily completion suggester builds in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151274 (https://phabricator.wikimedia.org/T391350) (owner: 10Ebernhardson) [17:22:24] (03CR) 10Scott French: "Thanks, all!" [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [17:22:29] (03CR) 10Scott French: [C:03+2] profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [17:25:17] (03CR) 10Bking: [C:03+2] cirrus: re-enable daily completion suggester builds in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151274 (https://phabricator.wikimedia.org/T391350) (owner: 10Ebernhardson) [17:25:25] !log jasmine@cumin1002 START - Cookbook sre.dns.netbox [17:26:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P76533 and previous config saved to /var/cache/conftool/dbconfig/20250527-172629-fceratto.json [17:27:01] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:28:08] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:28:28] (03PS6) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [17:28:34] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:51] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:28:54] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2001.codfw.wmnet with OS bookworm [17:28:57] RECOVERY - Squid on install1004 is OK: TCP OK - 0.004 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:29:03] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10860686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul2001.codfw.... [17:29:52] (03PS1) 10Ilias Sarantopoulos: Revert^2 "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151276 [17:30:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151276 (owner: 10Ilias Sarantopoulos) [17:30:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [17:31:02] jasmine@cumin1002 decommission (PID 3422444) is awaiting input [17:31:17] (03PS1) 10Fabfur: external_cloud_vendors: add bingbot [puppet] - 10https://gerrit.wikimedia.org/r/1151277 (https://phabricator.wikimedia.org/T395358) [17:33:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T395241)', diff saved to https://phabricator.wikimedia.org/P76534 and previous config saved to /var/cache/conftool/dbconfig/20250527-173319-fceratto.json [17:33:34] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:59] (03CR) 10Kgraessle: [C:03+1] Revert^2 "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151276 (owner: 10Ilias Sarantopoulos) [17:34:37] (03CR) 10Vgutierrez: [C:03+1] external_cloud_vendors: add bingbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151277 (https://phabricator.wikimedia.org/T395358) (owner: 10Fabfur) [17:35:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:39:05] !log jasmine@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1022-1025].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" [17:41:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T395241)', diff saved to https://phabricator.wikimedia.org/P76535 and previous config saved to /var/cache/conftool/dbconfig/20250527-174137-fceratto.json [17:41:40] !log jasmine@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1022-1025].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" [17:41:41] !log jasmine@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:41] !log jasmine@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[1022-1025].eqiad.wmnet [17:41:48] (03PS1) 10Dzahn: site/zuul: create skeleton role/profile for new zuul executors/runners [puppet] - 10https://gerrit.wikimedia.org/r/1151279 (https://phabricator.wikimedia.org/T394819) [17:41:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance [17:42:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T395241)', diff saved to https://phabricator.wikimedia.org/P76536 and previous config saved to /var/cache/conftool/dbconfig/20250527-174204-fceratto.json [17:47:16] (03PS1) 10Kimberly Sarabia: Deploy summaries pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151280 (https://phabricator.wikimedia.org/T393940) [17:48:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 3 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10860807 (10Dzahn) zuul2001 has been reimaged with bookworm now. [17:48:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T395241)', diff saved to https://phabricator.wikimedia.org/P76537 and previous config saved to /var/cache/conftool/dbconfig/20250527-174837-fceratto.json [17:50:05] (03CR) 10Dzahn: [C:04-1] "missing the "puppet7" hiera keys again :)" [puppet] - 10https://gerrit.wikimedia.org/r/1151279 (https://phabricator.wikimedia.org/T394819) (owner: 10Dzahn) [17:50:43] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 66940 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [17:51:10] (03PS2) 10Dzahn: site/zuul: create skeleton role/profile for new zuul executors/runners [puppet] - 10https://gerrit.wikimedia.org/r/1151279 (https://phabricator.wikimedia.org/T394819) [17:51:23] (03PS2) 10Kimberly Sarabia: Deploy summaries pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151280 (https://phabricator.wikimedia.org/T393940) [17:55:49] (03CR) 10Ssingh: [C:03+2] templates: lower TTLs for dyna.wm.org and upload.wm.org to 180s [dns] - 10https://gerrit.wikimedia.org/r/1150701 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [17:55:55] !log sukhe@dns1004 START - running authdns-update [17:56:11] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:56:33] !log sukhe@dns1004 END - running authdns-update [17:56:35] !log finished running authdns-update for lowering dyna/upload TTL to 180s: T394312 [17:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:40] T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312 [17:56:43] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:57:03] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:57:27] (03PS1) 10Dzahn: zuul::main: stop installing python docker-compose package [puppet] - 10https://gerrit.wikimedia.org/r/1151281 (https://phabricator.wikimedia.org/T393873) [17:57:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:58:07] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1151279/5691/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1151279 (https://phabricator.wikimedia.org/T394819) (owner: 10Dzahn) [17:58:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10860887 (10Milimetric) approved from our side (data engineering as stewards of the data) [17:58:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10860889 (10Milimetric) [17:59:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:59:53] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10860892 (10Milimetric) (I don't think we need to action this further, but I may be forgetting some steps, do ping us if so) [18:00:05] dancy and andre: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1800). [18:00:33] 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10860899 (10ssingh) [18:00:43] (03PS1) 10ZhaoFJx: cowikimedia: Enable Translate&Notifications Exten. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151284 (https://phabricator.wikimedia.org/T386776) [18:01:19] 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10860906 (10ssingh) 05Open→03Resolved a:03ssingh [18:02:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:02:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 3 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10860911 (10Dzahn) new roles created and added to https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventi... [18:03:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 3 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10860913 (10Dzahn) 05Open→03Resolved [18:03:29] (03CR) 10Dzahn: [C:03+2] zuul::main: stop installing python docker-compose package [puppet] - 10https://gerrit.wikimedia.org/r/1151281 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [18:03:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P76538 and previous config saved to /var/cache/conftool/dbconfig/20250527-180344-fceratto.json [18:08:34] (03PS1) 10Dzahn: zuul: add contint-roots group to new zuul skeleton roles [puppet] - 10https://gerrit.wikimedia.org/r/1151285 (https://phabricator.wikimedia.org/T394819) [18:08:47] (03CR) 10CI reject: [V:04-1] zuul: add contint-roots group to new zuul skeleton roles [puppet] - 10https://gerrit.wikimedia.org/r/1151285 (https://phabricator.wikimedia.org/T394819) (owner: 10Dzahn) [18:09:20] (03PS2) 10Dzahn: zuul: add contint-roots group to new zuul skeleton roles [puppet] - 10https://gerrit.wikimedia.org/r/1151285 (https://phabricator.wikimedia.org/T394819) [18:10:53] o/ Train stuff [18:11:12] !log zuul1001/zuul2001: sudo apt-get remove --purge docker-compose; sudo apt auto-remove [18:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:42] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151286 (https://phabricator.wikimedia.org/T392173) [18:11:44] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151286 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [18:12:04] (03CR) 10Dzahn: [C:03+2] "< mutante> !log zuul1001/zuul2001: sudo apt-get remove --purge docker-compose; sudo apt auto-remove" [puppet] - 10https://gerrit.wikimedia.org/r/1151281 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [18:12:29] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151286 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [18:12:57] (03CR) 10Dzahn: [C:03+2] zuul: add contint-roots group to new zuul skeleton roles [puppet] - 10https://gerrit.wikimedia.org/r/1151285 (https://phabricator.wikimedia.org/T394819) (owner: 10Dzahn) [18:18:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P76539 and previous config saved to /var/cache/conftool/dbconfig/20250527-181852-fceratto.json [18:21:48] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.3 refs T392173 [18:21:53] T392173: 1.45.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T392173 [18:22:07] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: give contint-roots access to new zuul VMs (was: create new admin group for "zuul devs") - https://phabricator.wikimedia.org/T394819#10860993 (10Dzahn) 05In progress→0... [18:23:40] (03PS4) 10AOkoth: doc: fix php version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) [18:23:50] (03CR) 10Jdlrobson: [C:03+1] "LGTM! Thanks Kim!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151280 (https://phabricator.wikimedia.org/T393940) (owner: 10Kimberly Sarabia) [18:26:31] (03PS5) 10AOkoth: doc: add php8.1 support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) [18:27:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:27:51] (03CR) 10Cory Massaro: [C:03+1] "Leaving to Grace to +2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [18:32:10] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [18:32:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:33:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:33:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T395241)', diff saved to https://phabricator.wikimedia.org/P76540 and previous config saved to /var/cache/conftool/dbconfig/20250527-183358-fceratto.json [18:34:02] hmm that's Telxius [18:34:14] topranks: ^ [18:34:21] sorry for the late [18:34:36] that's the magru transport [18:34:38] * topranks looking [18:34:41] <3 [18:36:16] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84085MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [18:36:24] yeah it flipped to edgeuno ok [18:36:25] https://grafana.wikimedia.org/goto/rAEsHcfNg [18:36:28] RECOVERY - ElasticSearch unassigned shard check - 9443 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [18:37:00] re: disk space T394955 [18:37:00] T394955: when servers are about to run out of disk, monitoring should notify the owners - https://phabricator.wikimedia.org/T394955 [18:37:12] topranks: ok nice. anything to do from our end? sorry for the late ping I said but even for future. [18:37:41] nah no it's ok not that late [18:37:49] I don't see anything in the maintenance calendar [18:37:57] I'll leave it a while if it doesn't recover raise a ticket with them [18:38:41] ok thank you! [18:39:36] link is "up" on the routers at both sides but yeah no frames are passing [18:39:49] the maint-announce inbox has unprocessed emails all the way back to March. so it might not be a suprise its not on the calendar and there is an issue with clinic duties [18:40:01] so definitely a carrier issue let's see how it goes, no need to panic [18:40:12] topranks: <3 [18:40:24] oh wait, the docs suddenly link to the sre group .. wtf [18:40:25] mutante: thanks yeah I did a search for emails too, they did have a planned work last night / early hours of today but it doesn't line up with the timing of this [18:40:42] topranks: gotcha!, ok, good [18:40:46] RECOVERY - ElasticSearch unassigned shard check - 9643 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [18:41:28] I take that back. it's "only" back to May 9th [18:41:33] (noted for future as well, as long as I guess it flips over to the other one I will just file a task in the future. I wasn't aware of what you guys usually do. [18:45:19] (03PS1) 10Bernard Wang: Deploy Vector empty search recommendations to wikivoyage, hewiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 [18:46:18] A task and a ticket with the carrier is basically what to do. But like you say not critical if it fails over ok. [18:46:29] thanks. [18:46:41] But sometimes they come back after 30 minutes you find out after 40 minutes raising a ticket, so I normally wait a little while [18:46:54] ok :) [18:55:35] (03PS2) 10Bernard Wang: Deploy Vector empty search recommendations to wikivoyage, hewiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) [18:59:02] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:59:12] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:59:42] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:02:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:03:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370 (10thcipriani) 03NEW [19:04:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:06:05] sukhe: my laziness pays off :P [19:07:40] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cirrussearch1060.eqiad.wmnet|cirrussearch1061.eqiad.wmnet|cirrussearch1062.eqiad.wmnet|cirrussearch1063.eqiad.wmnet|cirrussearch1064.eqiad.wmnet|cirrussearch1065.eqiad.wmnet|cirrussearch1066.eqiad.wmnet [19:08:29] (03CR) 10Jdlrobson: Deploy Vector empty search recommendations to wikivoyage, hewiki and itwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) (owner: 10Bernard Wang) [19:09:03] (03CR) 10Bernard Wang: Deploy Vector empty search recommendations to wikivoyage, hewiki and itwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) (owner: 10Bernard Wang) [19:09:42] topranks: hahaha [19:09:53] that, or some magic fairy dust you have [19:10:42] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 67935 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [19:10:51] (03PS3) 10Bernard Wang: Deploy Vector empty search recommendations to wikivoyage, hewiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) [19:11:18] (03PS1) 10Bking: cirrussearch: add row F, remove soon-to-be-decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151294 (https://phabricator.wikimedia.org/T394350) [19:11:58] !log bking@cumin2002 depool cirrussearch106[0-6] T394350 [19:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:04] T394350: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350 [19:22:00] (03PS1) 10Gergő Tisza: Add scrambled: password class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151298 [19:24:12] (03CR) 10Jcrespo: [C:03+1] Add scrambled: password class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151298 (owner: 10Gergő Tisza) [19:27:42] (03PS1) 10GOlson: App Interaction:: Add Tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151299 [19:28:36] (03PS1) 10Ebernhardson: search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) [19:30:11] jouncebot: now [19:30:11] For the next 0 hour(s) and 29 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T1800) [19:30:25] is this train window in use? [19:30:49] I have a security related change, not an emergency but would use the window if nothing else is happening [19:31:04] ^ dancy andre [19:31:25] https://versions.toolforge.org/ [19:31:35] Train window is open.. go for it [19:32:04] (03PS2) 10Ebernhardson: search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) [19:32:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:32:50] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5696/co" [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:33:42] (03CR) 10Ryan Kemper: [C:03+1] search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:34:19] thx [19:34:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151298 (owner: 10Gergő Tisza) [19:35:14] (03Merged) 10jenkins-bot: Add scrambled: password class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151298 (owner: 10Gergő Tisza) [19:35:37] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1151298|Add scrambled: password class]] [19:37:42] !log tgr@deploy1003 tgr: Backport for [[gerrit:1151298|Add scrambled: password class]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:39:32] !log tgr@deploy1003 tgr: Continuing with sync [19:41:11] (03PS1) 10Ebernhardson: Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) [19:41:13] (03PS1) 10Ebernhardson: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 [19:41:44] (03CR) 10CI reject: [V:04-1] Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:41:48] (03CR) 10CI reject: [V:04-1] search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (owner: 10Ebernhardson) [19:43:59] (03PS2) 10Bking: Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:44:08] (03PS2) 10Bking: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (owner: 10Ebernhardson) [19:44:37] (03CR) 10CI reject: [V:04-1] Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [19:44:52] (03CR) 10CI reject: [V:04-1] search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (owner: 10Ebernhardson) [19:46:29] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151298|Add scrambled: password class]] (duration: 10m 52s) [19:47:14] (03PS3) 10Ebernhardson: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 [19:47:49] (03CR) 10CI reject: [V:04-1] search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (owner: 10Ebernhardson) [19:48:07] (03PS4) 10Bernard Wang: Deploy Vector empty search recommendations to wikivoyage and group 1 wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) [19:50:00] (03PS3) 10Ebernhardson: Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) [19:50:00] (03PS4) 10Ebernhardson: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 [19:50:42] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 62157 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [19:52:49] (03Abandoned) 10Ebernhardson: Add search-chi-https service [puppet] - 10https://gerrit.wikimedia.org/r/1144647 (owner: 10Ebernhardson) [19:53:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151284 (https://phabricator.wikimedia.org/T386776) (owner: 10ZhaoFJx) [19:55:48] (03PS1) 10Bking: etcd data for search-{psi,omega} dns discovery [puppet] - 10https://gerrit.wikimedia.org/r/1151308 (https://phabricator.wikimedia.org/T143553) [19:57:30] (03PS3) 10Bking: search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T2000) [20:00:05] anzx and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] o/ [20:02:51] (03CR) 10Jdlrobson: [C:03+1] Deploy Vector empty search recommendations to wikivoyage and group 1 wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) (owner: 10Bernard Wang) [20:03:35] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:25] (03PS4) 10Bking: search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:06:00] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:06:08] (03PS1) 10Jdlrobson: Enable ReadingList special page on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151313 [20:08:24] Any deployer around? [20:08:38] I can deploy [20:08:54] tgr Thank you [20:08:58] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:09:07] anzx: around? [20:14:13] ZhaoFJx: what's up with https://wikimedia.co/ ? [20:14:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151284 (https://phabricator.wikimedia.org/T386776) (owner: 10ZhaoFJx) [20:15:20] (03Merged) 10jenkins-bot: cowikimedia: Enable Translate&Notifications Exten. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151284 (https://phabricator.wikimedia.org/T386776) (owner: 10ZhaoFJx) [20:15:23] tgr its co.wikimedia.org [20:15:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:30] right [20:15:30] for Translation extension [20:15:37] but I figured it would be a redirect [20:15:40] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1151284|cowikimedia: Enable Translate&Notifications Exten. (T386776)]] [20:15:44] T386776: Extension:Translate and Extension:TranslationNotifications on co.wikimedia.org - https://phabricator.wikimedia.org/T386776 [20:15:55] that can't possibly be Wikimedia affiliated, right? [20:16:11] Yeah, its a co, not org [20:17:47] !log tgr@deploy1003 zhaofjx, tgr: Backport for [[gerrit:1151284|cowikimedia: Enable Translate&Notifications Exten. (T386776)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:18:20] checking [20:19:48] tgr verified [20:21:54] !log tgr@deploy1003 zhaofjx, tgr: Continuing with sync [20:22:03] (03PS1) 10Bking: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1151316 (https://phabricator.wikimedia.org/T143553) [20:28:37] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:28:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1067-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:28:50] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151284|cowikimedia: Enable Translate&Notifications Exten. (T386776)]] (duration: 13m 10s) [20:28:55] T386776: Extension:Translate and Extension:TranslationNotifications on co.wikimedia.org - https://phabricator.wikimedia.org/T386776 [20:29:25] anzx: last call :) [20:29:33] tgr works fine, thanks for deploy! [20:33:34] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:59] (03CR) 10Ebernhardson: [C:03+1] search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1151316 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [20:34:00] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on elastic1067.eqiad.wmnet with reason: downtime until decom [20:39:20] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on elastic1054.eqiad.wmnet with reason: downtime until decom [20:39:50] tgr: here [20:43:34] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:07] (03PS3) 10Scott French: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [20:45:59] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [20:56:14] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151339 [20:58:16] (03CR) 10Scott French: [C:03+1] "Thanks, Daniel! This looks good. I'll work with Jasmine to get this merged and deployed during a MediaWiki infrastructure window some time" [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250527T2100) [21:00:41] Hello we will be using the window today [21:00:59] thcipriani: excited to try the UI finally!!! [21:02:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151280 (https://phabricator.wikimedia.org/T393940) (owner: 10Kimberly Sarabia) [21:02:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) (owner: 10Bernard Wang) [21:02:32] (03CR) 10Mazevedo: [C:03+1] App Interaction:: Add Tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151299 (owner: 10GOlson) [21:03:17] (03Merged) 10jenkins-bot: Deploy summaries pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151280 (https://phabricator.wikimedia.org/T393940) (owner: 10Kimberly Sarabia) [21:03:28] (03Merged) 10jenkins-bot: Deploy Vector empty search recommendations to wikivoyage and group 1 wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151290 (https://phabricator.wikimedia.org/T393943) (owner: 10Bernard Wang) [21:03:50] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1151280|Deploy summaries pilot (T393940)]], [[gerrit:1151290|Deploy Vector empty search recommendations to wikivoyage and group 1 wikipedias (T393943)]] [21:03:55] T393940: Deploy Summaries pilot - https://phabricator.wikimedia.org/T393940 [21:03:55] T393943: Deploy Vector empty search recommendations to pilot wikis - https://phabricator.wikimedia.org/T393943 [21:04:05] 🎉 [21:05:53] !log toyofuku@deploy1003 toyofuku, ksarabia, bwang: Backport for [[gerrit:1151280|Deploy summaries pilot (T393940)]], [[gerrit:1151290|Deploy Vector empty search recommendations to wikivoyage and group 1 wikipedias (T393943)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:07:06] (03CR) 10Bking: [C:03+2] deployment-prep: cleanup deployment-elastic values [puppet] - 10https://gerrit.wikimedia.org/r/1134151 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [21:08:35] anzx: sorry! can do it next time [21:08:46] Seems to be a legitimate failure - backing out bc I'm afraid of accidentally clicking proceed [21:08:54] Gonna still be using the window though, apologies [21:09:43] tgr: np, will schedule it for tomorrow [21:09:51] Not sure how else to note this, but nobody deploy for a second please!! I'm not seeing how either of these patches could be responsible, but getting a 5xx on the test servers after deploying to them [21:10:52] Could use some help from probably this channel while I look into it [21:13:02] JK wikipedia is down in prod [21:13:04] ?? [21:13:16] toyofuku: scap should stop the deploy automatically if the testservers don't work [21:13:26] It did and I aborted for good measure [21:13:33] This doesn't seem to be related to our deploy [21:13:39] But can you load english wikipedia? [21:13:43] yes [21:13:51] I'm getting a 503 [21:13:53] Glad it's just me [21:13:57] also there's lots of automated monitoring [21:14:03] a network issue maybe? [21:14:09] what URL exactly? [21:14:23] I mean I'm using IRC fine but I'll restart [21:14:24] https://en.wikipedia.org/ [21:14:31] and https://en.m.wikipedia.org/wiki/Main_Page [21:14:47] I mean, something between you and the Wikimedia servers [21:14:58] Okay no it is the test servers, apologies for the false alarm [21:15:03] My extension was stuck I guess [21:15:52] ok. ping us if we can help check anything. [21:15:59] Still seeing 503 on testservers (luckily not prod) [21:16:13] Is that happening for anyone else or am I experiencing Steph-specific problems fr? [21:16:27] toyofuku: are you sure jk.wikipedia is a valid language wiki? [21:16:36] scap doesn't automatically revert the testservers [21:17:01] and I don't think spiderpig supports reverts [21:17:28] I can revert, but I'm not sure how either of these patches would have brought down the test servers [21:17:29] so you need to ssh to the deploy host and do `scap backport --revert ` [21:17:33] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1151280 [21:17:36] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1151290 [21:18:17] i can repro the 503 on test servers, fwiw [21:18:22] p858snake|cloud: I was saying just kidding (english) wikipedia is down in prod (which turned out not to be true) [21:18:23] mwdebug is working for me [21:18:35] well enwiki isn't [21:18:52] so I guess it's the article summary? [21:18:54] 06SRE, 10DNS, 06serviceops, 06Traffic, 13Patch-For-Review: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10861433 (10Scott_French) Ah, I realized when cross-checking https://gerrit.wikimedia.org/r/1148981 against DNS that the `tj` records have not been added yet th... [21:18:55] on k8s-mwdebug [21:18:57] Good to know, so probably the article summary [21:19:12] (and mwdebug too, actually) [21:19:14] Any chance we have server logs from beta somewhere that I can grab before we turn it off? [21:19:27] https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 [21:19:29] Trying to figure out why this brought down enwiki when it's live in beta [21:19:40] if you mean mwdebug, not beta [21:19:45] Thank you both/all! [21:19:50] Yes I meant mwdebug [21:20:00] doesn't have any errors though [21:20:12] I will have test servers back up and running in a few - gonna try to spam the error first to get a signal [21:20:18] I was using k8-mwdebug [21:20:50] we are supposed to have logs, but i also can't find much [21:20:57] it's not a mediawiki 503, i'm pretty sure [21:21:07] https://logstash.wikimedia.org/goto/94c8d7008dcdc78773f9b6cda70743a8 [21:21:24] these are logs from the 503s i am seeing. not sure what to make of them [21:21:35] or it breaks hard enough that it can't even log [21:21:59] 🙃 [21:22:04] it could be that something different magically broke when you tried deploying 🤷‍♂️ [21:22:25] Anything is possible when it comes to computers [21:22:37] Any recommendations for debugging things to try before I revert that patch? [21:22:54] Hey folks! Can summarize for me what problem is currently being discussed? [21:22:55] find someone who understands apache logs [21:23:07] dancy: Apologies, everything is alright [21:23:15] revert on the deploy host, ssh into a debug host, scap pull, see if that fixes it [21:23:23] dancy: if you enable mwdebug, you get 503s on en.wp, and we don't know why [21:23:25] test servers are returning 503 for english wikipedia [21:23:29] (production is fine) [21:23:34] because of a patch I shipped [21:23:38] canceled on test servers [21:23:40] prod is fine [21:23:45] but nothing in the logs...is bizarre [21:23:46] about to revert said patch [21:23:57] toyofuku: Thank you! [21:23:59] we were wondering if we could figure out what's causing the 503s before I turn them off [21:24:04] Thank _you_! [21:24:16] (03PS1) 10ZhaoFJx: Revert "cowikimedia: Enable Translate&Notifications Exten." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151364 [21:24:21] So many people here to help and all I had to do was falsely claim enwiki was down in prod [21:24:25] 😅 [21:24:27] `Error 1146: Table 'cowikimedia.revtag' doesn't exist` ? [21:24:52] (guessing, based on the tile of the revert commit) [21:25:05] that means the translate extension db tables weren't created first [21:25:24] hmmmm [21:25:32] different wiki though [21:25:57] and ZhaoFJx tested that patch [21:26:15] I need an emergency deploy for https://gerrit.wikimedia.org/r/1151364 -- context is T386776, are SRE ok with a deployment? (cc: thcipriani bartosz arnoldokoth) I need someone to deploy. [21:26:16] T386776: Extension:Translate and Extension:TranslationNotifications on co.wikimedia.org - https://phabricator.wikimedia.org/T386776 [21:26:25] (03PS1) 10Stoyofuku-wmf: Revert "Deploy summaries pilot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151365 [21:26:42] Revert: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1151365 [21:26:46] i was about to be off and i don't think i can help much at this point, so i'll leave you to it. good luck figuring this out D: [21:26:57] tgr I did test, interestingly it works at the first, but after I leaved the train, it no longer work... [21:27:00] ZhaoFJx: I think I'm blocking deploys please hold on a sec [21:27:04] hm [21:27:05] (03CR) 10BCornwall: [C:03+1] Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:27:10] (03CR) 10BCornwall: [C:03+1] search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (owner: 10Ebernhardson) [21:27:10] toyofuku take your time [21:27:13] in any case can't see that affecting enwiki [21:27:16] thank you [21:27:31] toyofuku: do you want me to test the revert? [21:27:39] I mean I wanna clear the runway and still finish the deploy of the other patch assuming it's fine so let's do that first [21:27:44] Testing it would be great thank you [21:27:57] If someone wouldn't mind approving unless I'm allowed to self +1 that would also be great [21:28:05] reviewing/approving* [21:28:09] I'll revert locally first [21:28:12] Its a small wiki so not really urgent [21:28:21] as long as it can be fixed today, it would be fine [21:28:24] sorry for the mess [21:28:33] it was a mess before you got here <3 [21:28:50] !log tgr@deploy1003 Locking from deployment [MediaWiki]: debugging gerrit 1151280 [21:30:55] You can self +2 reverts, there's rarely an issue with that [21:31:05] Sounds good [21:31:17] (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good to me, the author" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151365 (owner: 10Stoyofuku-wmf) [21:31:44] Deploying out to test servers at least [21:31:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151365 (owner: 10Stoyofuku-wmf) [21:32:07] (03CR) 10BCornwall: [C:03+1] conftool: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [21:32:41] (03Merged) 10jenkins-bot: Revert "Deploy summaries pilot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151365 (owner: 10Stoyofuku-wmf) [21:34:02] okay so set $wmgUseArticleSummaries to false for enwiki, scap pulled to mwdebug1002, enwiki works again [21:34:08] that's pretty definitive [21:34:23] (03PS2) 10ZhaoFJx: Revert "cowikimedia: Enable Translate&Notifications Exten." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151364 (https://phabricator.wikimedia.org/T395382) [21:34:28] Yeah we can solve this while not blocking the deploy train [21:35:00] I'm gonna deploy the revert and the other patch assuming enwiki is working on testservers again [21:35:06] And that should take care of everything, right? [21:35:07] !log tgr@deploy1003 Unlocked for deployment [MediaWiki]: debugging gerrit 1151280 (duration: 06m 17s) [21:35:24] the other patch is already merged, right? [21:35:25] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1151365|Revert "Deploy summaries pilot"]] [21:35:30] correct [21:35:39] so you can deploy the revert, that will sync out the other change [21:35:46] three patches total merged: summaries, typeahead search, summaries revert [21:35:52] Okay perfect [21:36:03] although I guess no harm in adding it explicitly [21:36:11] Want to double triple check my understanding since this is the spiciest deploy I've done here so far [21:36:27] I already started the revert deploy without it, but assuming it'll go out since it's on the branch [21:36:28] 🌶️ [21:36:40] yeah its fine either way [21:37:01] UI had a note saying `WARNING: Nothing has been rolled back.` which I interpreted to mean "next deploy is taking these patches with it" [21:37:15] Okay cool, thanks everyone for playing [21:37:32] !log toyofuku@deploy1003 toyofuku: Backport for [[gerrit:1151365|Revert "Deploy summaries pilot"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:37:35] yeah [21:38:01] to a first approximation, scap / spiderpig will just deploy the current state of the git master [21:38:35] And enwiki is indeed working on mwdebug [21:38:39] Twas the summaries patch [21:38:51] if there is a difference between that and the git checkout on the deploy host that's not explained by the patch you are telling it to deploy, it will complain about it, but then deploy the git master anyway [21:39:16] (if you tell it to) [21:41:43] Proceeding with deploy [21:41:47] !log toyofuku@deploy1003 toyofuku: Continuing with sync [21:44:03] (03CR) 10Arlolra: "@jforrester@wikimedia.org Thanks for scheduling the backport. I've requested access to spiderpig because I don't have permission to use i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [21:48:42] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151365|Revert "Deploy summaries pilot"]] (duration: 13m 16s) [21:49:03] ZhaoFJx: should be all yours now [21:49:41] toyofuku thanks [21:49:51] 🫡 [21:50:02] Thanks once again to all who helped me while I was floundering [21:50:18] Spider pig was great!! [21:50:42] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 64285 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [21:51:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151364 (https://phabricator.wikimedia.org/T395382) (owner: 10ZhaoFJx) [21:51:53] toyofuku: well I'm glad you got to try it out! sorry you had a harrowing deploy <3 [21:51:55] (03Merged) 10jenkins-bot: Revert "cowikimedia: Enable Translate&Notifications Exten." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151364 (https://phabricator.wikimedia.org/T395382) (owner: 10ZhaoFJx) [21:52:15] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1151364|Revert "cowikimedia: Enable Translate&Notifications Exten." (T395382)]] [21:52:20] T395382: Internal error after deployment on cowikimedia - https://phabricator.wikimedia.org/T395382 [21:52:29] Harrowing Deploy is my middle name lol [21:54:22] !log tgr@deploy1003 zhaofjx, tgr: Backport for [[gerrit:1151364|Revert "cowikimedia: Enable Translate&Notifications Exten." (T395382)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:54:51] A pop-up said "⧼cm-mw-i18n-failed⧽" [21:55:33] Its gone now after force refresh [21:55:49] tgr the patch works [22:01:47] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:02:41] !log tgr@deploy1003 zhaofjx, tgr: Continuing with sync [22:03:17] !log Cleaning up logs older than 70 days in centrallog2002 [22:03:19] that sounds like an untranslated i18n key [22:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:44] or maybe just some hiccup rebuilding the message cache [22:04:04] sure [22:04:54] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: add row F, remove soon-to-be-decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151294 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [22:05:00] anzx: still around? [22:05:36] tgr: yes i am here [22:05:36] (03CR) 10Bking: [C:03+2] cirrussearch: add row F, remove soon-to-be-decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151294 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [22:05:48] we can deploy your changes after this [22:06:02] although it seems like a somewhat cursed day for deploys [22:07:04] Agreed [22:08:39] https://co.wikimedia.org/ is now back to normal [22:08:58] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1096.eqiad.wmnet|cirrussearch1097.eqiad.wmnet|cirrussearch1098.eqiad.wmnet|cirrussearch1099.eqiad.wmnet|cirrussearch1100.eqiad.wmnet|cirrussearch1101.eqiad.wmnet|cirrussearch1102.eqiad.wmnet|cirrussearch1107.eqiad.wmnet|cirrussearch1110.eqiad.wmnet|cirrussearch1124.eqiad.wmnet|cirrussearch1125.eqiad.wmnet [22:09:42] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151364|Revert "cowikimedia: Enable Translate&Notifications Exten." (T395382)]] (duration: 17m 26s) [22:09:46] T395382: Internal error after deployment on cowikimedia - https://phabricator.wikimedia.org/T395382 [22:12:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151210 (https://phabricator.wikimedia.org/T393551) (owner: 10Anzx) [22:12:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx) [22:12:10] (03CR) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [22:13:57] (03Merged) 10jenkins-bot: slwikibooks: update tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151210 (https://phabricator.wikimedia.org/T393551) (owner: 10Anzx) [22:13:59] (03Merged) 10jenkins-bot: ruwikisource: add Автор (Author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx) [22:14:23] (03CR) 10Amire80: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [22:14:24] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1151210|slwikibooks: update tagline (T393551)]], [[gerrit:1150665|ruwikisource: add Автор (Author) namespace (T395193)]] [22:14:30] T393551: wikibooks-tagline-sl.svg is not displayed correctly in Firefox - https://phabricator.wikimedia.org/T393551 [22:14:30] T395193: Add Author namespace for Russian Wikisource - https://phabricator.wikimedia.org/T395193 [22:15:44] (03PS2) 10Dzahn: cache/text: remove commented reference to static-rt from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1137485 [22:15:58] (03CR) 10CI reject: [V:04-1] cache/text: remove commented reference to static-rt from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1137485 (owner: 10Dzahn) [22:16:16] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:16:30] !log tgr@deploy1003 anzx, tgr: Backport for [[gerrit:1151210|slwikibooks: update tagline (T393551)]], [[gerrit:1150665|ruwikisource: add Автор (Author) namespace (T395193)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:17:14] tgr: both looks good [22:17:25] !log tgr@deploy1003 anzx, tgr: Continuing with sync [22:17:54] (03PS1) 10Bking: relforge: disable monitoring notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) [22:17:55] (03PS3) 10Dzahn: cache/text: remove commented reference to static-rt from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1137485 [22:18:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [22:19:01] (03PS4) 10Dzahn: cache/text: remove commented reference to static-rt from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1137485 (https://phabricator.wikimedia.org/T385777) [22:20:14] (03PS1) 10Dzahn: profile: delete static_rt profile and erb template [puppet] - 10https://gerrit.wikimedia.org/r/1151382 (https://phabricator.wikimedia.org/T385777) [22:21:23] (03PS2) 10Bking: relforge: disable monitoring notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) [22:22:17] (03CR) 10Dzahn: "somehow I feel like we will want to change settings for just one of them in the future" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [22:24:18] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151210|slwikibooks: update tagline (T393551)]], [[gerrit:1150665|ruwikisource: add Автор (Author) namespace (T395193)]] (duration: 09m 53s) [22:24:20] tgr: please run for slwikibooks logo change https://www.irccloud.com/pastebin/78TYcPam/ [22:24:23] T393551: wikibooks-tagline-sl.svg is not displayed correctly in Firefox - https://phabricator.wikimedia.org/T393551 [22:24:23] T395193: Add Author namespace for Russian Wikisource - https://phabricator.wikimedia.org/T395193 [22:25:13] tgr: and also https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes for ruwikisource namespace change [22:25:19] sure [22:27:33] done [22:27:39] tgr: thanks for deploying [22:27:53] !log UTC late deploys done [22:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:11] It was a long deploy [22:50:23] (03PS1) 10Kimberly Sarabia: Deploy summaries to text wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151385 (https://phabricator.wikimedia.org/T393940) [23:06:28] (03CR) 10Dzahn: "The reason I have 3 separate pending changes is because I wanted to avoid mixing a reorganisation of lookups with the simple "start replic" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [23:16:08] (03PS1) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [23:16:08] (03CR) 10Andrea Denisse: "Hi team, I'm introducing this config file temporarily to gather data to debug the issue further as suggested by one of the rsyslog maintai" [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1151387 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1151387 (owner: 10TrainBranchBot) [23:49:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1151387 (owner: 10TrainBranchBot)