[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 (owner: 10TrainBranchBot) [00:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 [00:07:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 (owner: 10TrainBranchBot) [00:11:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:11:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:12:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:15:47] (03CR) 10Xcollazo: "PPC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [00:27:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:30:22] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 (owner: 10TrainBranchBot) [01:01:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for thanos-be1006.mgmt:22 - https://phabricator.wikimedia.org/T399052 (10phaultfinder) 03NEW [01:13:33] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:18:33] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:36] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:26:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10986891 (10Andrew) >>! In T394333#10986464, @Jclark-ctr wrote: > @dcaro @Andrew @cmooney @ayounsi I need some assistance. I need to open a block of 4x... [03:18:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [04:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:13:53] (03PS1) 10Clare Ming: xLab: Deploy v0.7.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167322 (https://phabricator.wikimedia.org/T397363) [04:14:37] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.7.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167322 (https://phabricator.wikimedia.org/T397363) (owner: 10Clare Ming) [04:16:06] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167322 (https://phabricator.wikimedia.org/T397363) (owner: 10Clare Ming) [04:16:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:17:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:23:25] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [04:23:57] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [04:40:00] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10987045 (10Joe) Quite a bit of the rationalization will depend upon the results of another hypothesis, the one about trusted bots. What we can however build while that's still being designed. What should... [05:54:02] Deploying MinT on staging (staging only change) [05:54:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:04] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.049 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:21] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T0600) [06:06:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:07:56] (03PS1) 10KartikMistry: machinetranslation: Remove extra / from s3 URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167328 (https://phabricator.wikimedia.org/T335491) [06:12:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10987083 (10Marostegui) Thank you I can reach them finely. [06:13:23] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:14:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1213,1217,1250].eqiad.wmnet with reason: m3 master switchover T398818 [06:14:36] T398818: Switchover m3 (phabricator) master db1213 -> db1250 - https://phabricator.wikimedia.org/T398818 [06:14:52] (03CR) 10Volans: "I think you could take advantage of existing implementations in wmflib and spicerack. I've suggested them inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [06:15:54] (03PS1) 10Marostegui: m3 proxies: Add db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1167329 (https://phabricator.wikimedia.org/T398818) [06:16:24] (03CR) 10Fabfur: [C:03+2] varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [06:17:04] (03CR) 10Marostegui: [C:03+2] m3 proxies: Add db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1167329 (https://phabricator.wikimedia.org/T398818) (owner: 10Marostegui) [06:18:23] (03CR) 10KartikMistry: [C:03+2] machinetranslation: Remove extra / from s3 URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167328 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [06:19:59] (03Merged) 10jenkins-bot: machinetranslation: Remove extra / from s3 URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167328 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [06:20:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1213,1217,1250].eqiad.wmnet with reason: m3 master switchover T398818 [06:20:29] T398818: Switchover m3 (phabricator) master db1213 -> db1250 - https://phabricator.wikimedia.org/T398818 [06:21:01] (03PS1) 10Marostegui: Revert "m3 proxies: Add db1250" [puppet] - 10https://gerrit.wikimedia.org/r/1167330 [06:21:04] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:21:32] (03CR) 10Marostegui: [C:03+2] Revert "m3 proxies: Add db1250" [puppet] - 10https://gerrit.wikimedia.org/r/1167330 (owner: 10Marostegui) [06:23:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:27:00] (03PS1) 10Marostegui: mariadb: Promote db1250 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1167375 (https://phabricator.wikimedia.org/T398818) [06:27:54] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1250 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1167375 (https://phabricator.wikimedia.org/T398818) (owner: 10Marostegui) [06:28:51] I am going to switch phabricator master, expect around 1 minute of RO time https://phabricator.wikimedia.org/T398818 [06:29:19] !log Failover m3 from db1213 to db1250 - T398818 [06:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:22] T398818: Switchover m3 (phabricator) master db1213 -> db1250 - https://phabricator.wikimedia.org/T398818 [06:31:58] Done, RO was 30 seconds [06:36:04] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:36:28] (03PS1) 10Marostegui: db1213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167435 (https://phabricator.wikimedia.org/T398805) [06:37:07] (03CR) 10Marostegui: [C:03+2] db1213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167435 (https://phabricator.wikimedia.org/T398805) (owner: 10Marostegui) [06:38:34] (03CR) 10Volans: "My 2.5 cents inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164147 (owner: 10Ayounsi) [06:43:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:44:34] 06SRE, 10LDAP-Access-Requests: Grant Access to for  - https://phabricator.wikimedia.org/T399020#10987164 (10Aklapper) 05Open→03Invalid Hi, per https://phabricator.wikimedia.org/tag/ldap-access-requests/ , `wmf` membership needs to be requested via IDM nowadays [06:46:52] (03PS1) 10Marostegui: db2232: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167437 (https://phabricator.wikimedia.org/T399060) [06:47:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2232].codfw.wmnet,db[1207,1217].eqiad.wmnet with reason: migration to mariadb 10.11 [06:49:06] (03CR) 10Marostegui: [C:03+2] db2232: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167437 (https://phabricator.wikimedia.org/T399060) (owner: 10Marostegui) [06:53:04] (03PS1) 10Elukey: EventStreamConfig: add the maps.tiles_change_bookworm stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) [06:53:59] (03CR) 10Elukey: services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [06:54:08] kart_, o/ [06:54:48] here [07:00:04] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T0700). Please do the needful. [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:07] (03PS1) 10Volans: cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 [07:03:12] kart_, shall we start? [07:03:28] sure. Give me a minute. [07:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:04:45] abijeet: You can watch deployment at, https://spiderpig.wikimedia.org/jobs/288 [07:05:27] (03Merged) 10jenkins-bot: CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:05:28] kart_, cool [07:05:31] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: sync [07:05:58] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1152065|CX: Add virtual-cx-shared DatabaseVirtualDomains (T348513)]] [07:06:01] T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513 [07:08:08] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1152065|CX: Add virtual-cx-shared DatabaseVirtualDomains (T348513)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:10:12] abijeet: Testing please.. [07:10:19] kart_, I can test with the debug tool? [07:10:25] Yes [07:10:56] kart_, ok thanks. We just need to check that CX can still save drafts and publish, even if its to the user namespace. On it [07:11:04] (03CR) 10Vgutierrez: [C:03+2] hiera: Issue dedicated certs for probenet endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [07:17:15] abijeet: I can save and load article [07:17:22] (03CR) 10Jgiannelos: "For what it's worth PCS *can* render output for files:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [07:17:33] kart_, testing on the mobile editor once [07:20:29] (having to wait for the 10min "Review translation" dialog) [07:20:37] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync [07:20:49] Publishing on username space worked. [07:22:52] kart_, looks good. lets keep an eye on the logs as well [07:23:10] !log upload python3-docker-report to bookworm-wikimedia [07:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:19] !log upload python3-docker-report 0.0.16 to bookworm-wikimedia [07:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:56] (03PS1) 10Jgiannelos: pcs: Disable staging profiler, set log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 [07:26:03] !log kartik@deploy1003 kartik, abi: Continuing with sync [07:26:09] abijeet: sure [07:26:39] (03CR) 10Filippo Giunchedi: [C:03+1] pcs: Disable staging profiler, set log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 (owner: 10Jgiannelos) [07:27:52] (03PS2) 10Jgiannelos: pcs: Set staging log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 [07:28:57] (03PS1) 10Brouberol: airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) [07:28:58] (03PS1) 10Brouberol: airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) [07:29:23] (03CR) 10Jgiannelos: [C:03+2] pcs: Set staging log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 (owner: 10Jgiannelos) [07:31:09] !log installing nginx security updates [07:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:12] (03Merged) 10jenkins-bot: pcs: Set staging log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 (owner: 10Jgiannelos) [07:31:19] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152065|CX: Add virtual-cx-shared DatabaseVirtualDomains (T348513)]] (duration: 25m 21s) [07:31:23] T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513 [07:31:36] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:31:41] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:32:14] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:32:35] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:34:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1036', diff saved to https://phabricator.wikimedia.org/P78817 and previous config saved to /var/cache/conftool/dbconfig/20250709-073458-marostegui.json [07:38:11] (03PS1) 10Marostegui: mariadb: Productionize es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1167520 (https://phabricator.wikimedia.org/T395771) [07:39:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance [07:42:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [07:42:10] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [07:42:17] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [07:42:17] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [07:42:20] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1167520 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [07:46:29] (03CR) 10Kevin Bazira: [C:03+1] airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:50:38] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [07:50:39] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:50:59] (03PS1) 10Jgiannelos: pcs: Use purge only requests for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [07:52:28] (03PS2) 10Jgiannelos: pcs: Use purge only requests for staging mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [07:53:27] (03CR) 10Kevin Bazira: [C:03+1] airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:53:53] (03CR) 10Brouberol: [C:03+2] airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:53:55] (03CR) 10Brouberol: [C:03+2] airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:54:31] (03CR) 10Hashar: "recheck after having deployed the CI config (If4e694a76891f65fa159b4e3c0aca26c996ffe6c and I426d3370f3d290f938bdefecc92b1e31e6300e3f)" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [07:54:37] (03PS3) 10Jgiannelos: pcs: Use purge only requests for staging mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [07:54:48] (03CR) 10CI reject: [V:04-1] rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [07:55:27] (03Merged) 10jenkins-bot: airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:55:35] (03Merged) 10jenkins-bot: airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:58:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [07:58:42] (03PS3) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [07:58:51] (03CR) 10Jgiannelos: "@hnowlan@wikimedia.org Happy to hear if you have a better to idea to only override the header using YAML anchors, but i didn't find a bett" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (owner: 10Jgiannelos) [07:58:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [07:58:58] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [08:00:04] andre and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T0800). [08:00:12] o/ [08:01:02] (03PS4) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:01:31] !log aklapper@deploy1003 Started scap sync-world: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] [08:01:36] T398925: TypeError: MediaWiki\Api\ApiFeedContributions::feedItem(): Argument #1 ($row) must be of type stdClass, Flow\Formatter\ContributionsRow given, called in /srv/mediawiki/php-1.45.0-wmf.9/includes/api/ApiFeedContributions.php on l - https://phabricator.wikimedia.org/T398925 [08:02:45] (03CR) 10Volans: [C:03+1] "I'm not too familiar with this puppettization but the change looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [08:03:57] !log aklapper@deploy1003 zabe, aklapper: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:04:46] !log aklapper@deploy1003 zabe, aklapper: Continuing with sync [08:09:53] !log aklapper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] (duration: 08m 21s) [08:09:58] T398925: TypeError: MediaWiki\Api\ApiFeedContributions::feedItem(): Argument #1 ($row) must be of type stdClass, Flow\Formatter\ContributionsRow given, called in /srv/mediawiki/php-1.45.0-wmf.9/includes/api/ApiFeedContributions.php on l - https://phabricator.wikimedia.org/T398925 [08:11:07] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167529 (https://phabricator.wikimedia.org/T392179) [08:11:13] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167529 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:59] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167529 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:14:10] I will be rolling out a minor Netbox update in a few minutes. See: https://phabricator.wikimedia.org/T397300 [08:15:27] (03PS1) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [08:15:39] (03PS2) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [08:16:49] (03PS3) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [08:17:07] !log slyngshede@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:17:35] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:18:26] !log Deploying Netbox v4.0.11 to production T397300 [08:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [08:18:54] (03PS5) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:19:14] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [08:20:33] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.9 refs T392179 [08:20:37] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [08:21:01] (03PS6) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:21:06] (03PS2) 10Hashar: rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:21:09] !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1002 [08:21:53] (03PS7) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:24:38] (03CR) 10Hashar: "I have added:" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:24:45] (03CR) 10Hashar: [C:03+1] rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:26:48] (03PS1) 10Brouberol: kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) [08:27:19] (03PS2) 10Brouberol: kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) [08:27:31] slyngshede@cumin1002 python-code (PID 3995723) is awaiting input [08:27:53] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [08:28:19] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6203/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [08:28:48] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1002 [08:29:12] !log slyngshede@cumin1003 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1003 [08:33:26] slyngshede@cumin1003 python-code (PID 954055) is awaiting input [08:34:30] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cephosd2001.codfw.wmnet [08:35:16] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cephosd2001.codfw.wmnet [08:38:06] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [08:40:27] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [08:40:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:42:34] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1003 [08:46:14] 06SRE, 10Observability-Alerting: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423#10987361 (10fgiunchedi) 05Open→03Invalid I don't think we've seen a recorrence of this problem, and we fixed the host-related recoveries in {T264016} [08:46:52] 06SRE, 10Observability-Alerting: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570#10987365 (10fgiunchedi) 05Open→03Invalid Related tasks have been resolved, resolving this one too [08:47:39] 06SRE, 10Observability-Alerting: Better abstractions for puppet & icinga/nagios/shinken - https://phabricator.wikimedia.org/T85624#10987367 (10fgiunchedi) 05Open→03Declined I'm boldly declining this task as part of the icinga/am migration [08:50:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:55:27] 06SRE, 06SRE-OnFire, 10Observability-Alerting, 07I18n: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896#10987381 (10fgiunchedi) 05Open→03Declined It looks like this is technically possible, though we'll need a subscription to L... [08:56:38] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: icinga login case mismatch - https://phabricator.wikimedia.org/T275920#10987388 (10fgiunchedi) 05Open→03Declined Given that Icinga is on its way out I'm boldly declining the task [08:57:36] 06SRE, 10Observability-Alerting: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662#10987392 (10fgiunchedi) 05Open→03Declined We are reworking metamonitoring to use alertmanager/prometheus instead, and icinga is on its way out thus declining the task [08:58:06] (03PS2) 10Elukey: profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) [08:58:48] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407#10987401 (10fgiunchedi) 05Open→03Invalid Puppet now runs every 5m on the alert hosts, and AFAIK we haven't seen a reoccurence of this? R... [09:01:41] !log Upgrade completed Netbox v4.0.11 T397300 [09:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:44] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [09:03:46] (03PS3) 10Elukey: profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) [09:07:00] (03PS1) 10Btullis: Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) [09:07:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10987432 (10Marostegui) 05Resolved→03Open There's something wrong with these hosts RAID's ` root@es1048:~# pvs PV VG Fmt Attr PSize PFree /dev/sda3 tank lvm2 a-... [09:08:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:08:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10987435 (10Marostegui) Both hosts, es1047 and es1048 are showing the same issue. [09:10:45] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6205/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:11:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1001.eqiad.wmnet [09:13:37] (03PS1) 10Brouberol: deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) [09:14:02] (03PS2) 10Brouberol: deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) [09:15:17] (03CR) 10Elukey: "@rcoccioli@wikimedia.org I added a couple of changes, I realized that the Exec command was wrong :( We need something like:" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:15:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1001.eqiad.wmnet [09:16:16] (03PS2) 10Btullis: Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) [09:16:57] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6206/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:17:11] (03CR) 10Btullis: [C:03+1] deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:17:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6207/" [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:17:41] (03CR) 10Btullis: [C:03+1] kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [09:17:44] (03PS1) 10Slyngshede: data.yaml: Offboarding chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1167546 [09:18:19] (03CR) 10Brouberol: [V:03+1 C:03+2] kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [09:18:45] (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:19:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:19:40] (03CR) 10Brouberol: [C:03+1] Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:19:54] (03PS3) 10Btullis: Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) [09:20:30] (03PS1) 10Clément Goubert: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) [09:20:57] (03CR) 10Btullis: [C:03+2] Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:21:13] (03CR) 10Btullis: [V:03+1 C:03+2] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6208/" [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:21:26] (03PS2) 10Clément Goubert: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) [09:21:35] (03PS3) 10Clément Goubert: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) [09:21:57] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:23:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host cephosd2001.codfw.wmnet [09:27:06] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10987462 (10ABran-WMF) >>! In T372804#10985583, @Dzahn wrote: > "determine if this is resolved once it's a warm standby host or if we switch production to... [09:29:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10987465 (10MoritzMuehlenhoff) [09:29:16] (03CR) 10Volans: [C:03+1] "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:29:39] (03PS1) 10Clément Goubert: PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) [09:29:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10987466 (10MoritzMuehlenhoff) [09:29:40] (03PS1) 10Clément Goubert: PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) [09:29:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10987467 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [09:30:34] (03CR) 10CI reject: [V:04-1] PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [09:34:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10987482 (10elukey) I checked the BIOS configs via Redfish and they are different from what we expect, the cookbook fails since we expect `BootModeSelect` to be present... [09:35:16] (03CR) 10Arnaudb: [C:03+2] "thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:35:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10987484 (10klausman) >>! In T393948#10985332, @Jclark-ctr wrote: > @klausman Will this be legacy or uefi? it is reachable We don't have a particular preference for... [09:39:28] (03PS1) 10David Caro: aptly: add arm64 arch support [puppet] - 10https://gerrit.wikimedia.org/r/1167551 (https://phabricator.wikimedia.org/T398016) [09:41:29] (03Merged) 10jenkins-bot: gerrit: standardize expected rc on systemctl check [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:41:30] (03CR) 10Majavah: [C:03+1] aptly: add arm64 arch support [puppet] - 10https://gerrit.wikimedia.org/r/1167551 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [09:43:07] 10SRE-tools, 06Infrastructure-Foundations, 10SRE Observability (FY2025/2026-Q1): More frequent Puppet runs on the alert hosts? - https://phabricator.wikimedia.org/T398444#10987493 (10Volans) I wonder if the prometheus servers have a similar behavior of applying changes from puppet exported resources. FYI th... [09:44:25] (03PS1) 10Tiziano Fogli: pdb_resource_exporter: add unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167554 (https://phabricator.wikimedia.org/T395442) [09:45:34] (03PS1) 10Fabfur: varnish: remove X-Known-Client netmapper [puppet] - 10https://gerrit.wikimedia.org/r/1167555 (https://phabricator.wikimedia.org/T396621) [09:45:35] !log installing Zookeeper security updates on zk-flink [09:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:12] (03CR) 10David Caro: [C:03+2] aptly: add arm64 arch support [puppet] - 10https://gerrit.wikimedia.org/r/1167551 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [09:48:20] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [09:48:23] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [09:50:13] (03CR) 10Filippo Giunchedi: [C:03+1] pdb_resource_exporter: add unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167554 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:57:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167557 [09:58:00] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167557 (owner: 10PipelineBot) [09:58:05] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [09:58:08] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [09:59:11] (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: add unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167554 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [09:59:57] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167557 (owner: 10PipelineBot) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1000) [10:03:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:03:11] (03CR) 10Clément Goubert: [C:03+2] mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:03:28] (03PS2) 10Clément Goubert: mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) [10:04:06] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:04:26] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:05:15] (03CR) 10Clément Goubert: [C:03+2] mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:05:36] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10987531 (10MoritzMuehlenhoff) [10:05:58] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10987532 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [10:06:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:06:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:06:47] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [10:06:50] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [10:11:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:13:25] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cephosd2001.codfw.wmnet [10:13:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host cephosd2001.codfw.wmnet [10:14:25] !log Cutting off access to mwmaint servers - T397017 [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:28] T397017: Turn down mwmaint production servers - https://phabricator.wikimedia.org/T397017 [10:14:33] (03CR) 10Clément Goubert: [C:03+2] mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:14:41] (03PS4) 10Clément Goubert: mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) [10:15:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167546 (owner: 10Slyngshede) [10:15:58] (03CR) 10Clément Goubert: [C:03+2] mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:18:59] (03PS1) 10FNegri: offboard-user: remove WMCS-related LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1167561 (https://phabricator.wikimedia.org/T398215) [10:19:27] (03CR) 10Ladsgroup: Catalog newsletter tables (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [10:19:32] (03PS5) 10Pppery: Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) [10:19:33] (03CR) 10Ladsgroup: [C:03+2] Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [10:19:35] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [10:19:42] (03PS3) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [10:20:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2001.codfw.wmnet [10:20:42] (03Abandoned) 10FNegri: offboard-user: remove WMCS-related LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1167561 (https://phabricator.wikimedia.org/T398215) (owner: 10FNegri) [10:20:51] (03PS1) 10Cathal Mooney: WMF Plugin: do not process disabled ports for block speed setting [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1167564 (https://phabricator.wikimedia.org/T394333) [10:24:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2001.codfw.wmnet [10:25:32] (03PS1) 10AikoChou: ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) [10:29:03] (03CR) 10Clément Goubert: [C:03+2] Revert "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167177 (owner: 10Clément Goubert) [10:29:56] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10987593 (10Jhancock.wm) @elukey the only settings i can change on this boss card is to create and delete the virtual disk. I didn't see any other settings. [10:30:32] (03Merged) 10jenkins-bot: Revert "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167177 (owner: 10Clément Goubert) [10:30:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd2001.codfw.wmnet [10:33:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:06] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987612 (10Jhancock.wm) @Marostegui it does have a hardware raid. Feel free to change it and reimage it to your liking. [10:34:20] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1167546 (owner: 10Slyngshede) [10:36:20] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:37:04] !log Restoring memory limits on mw-cron - T395436 - T395465 [10:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:09] T395436: Limit CPU usage for mw-on-k8s cli deployments - https://phabricator.wikimedia.org/T395436 [10:37:09] T395465: Investigate EQIAD daily completion suggester rebuild failure - https://phabricator.wikimedia.org/T395465 [10:38:35] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:38:42] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cephosd1001.eqiad.wmnet [10:39:14] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cephosd1001.eqiad.wmnet [10:39:30] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:42:07] PROBLEM - Host cephosd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:44:00] (03CR) 10Vgutierrez: pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:44:16] (03CR) 10Hnowlan: pcs: Use purge only requests for staging mobile-html transcludes (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (owner: 10Jgiannelos) [10:45:28] (03PS1) 10Zabe: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) [10:45:39] (03PS1) 10Zabe: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) [10:45:48] (03CR) 10Vgutierrez: "even with this refactor we are looking at a class with 520 lines already, could it make sense to split this per service?" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:47:13] (03PS4) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [10:47:34] (03CR) 10Hnowlan: [C:03+1] PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:47:48] (03PS5) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [10:48:14] (03CR) 10Hnowlan: [C:03+1] PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:48:38] (03CR) 10Hnowlan: [C:03+1] PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:49:43] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:50:24] (03CR) 10Elukey: [C:03+2] admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 (owner: 10Elukey) [10:50:24] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [10:51:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:51:20] (03PS1) 10Aqu: data-engineering: Refine switch over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [10:51:37] !log elukey@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:51:38] (03CR) 10Fabfur: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:51:57] !log elukey@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:52:08] (03Merged) 10jenkins-bot: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:52:31] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1167548|PS.php: Disable secondary poolcounters for reboot (T395240)]] [10:52:53] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@3a0cdd4]: bump image suggestions to v1.8.0 [10:53:17] 10SRE-tools, 06Data-Platform-SRE, 10Spicerack: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#10987669 (10brouberol) [10:53:30] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@3a0cdd4]: bump image suggestions to v1.8.0 (duration: 00m 48s) [10:53:37] (03PS4) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [10:53:46] (03PS1) 10Zabe: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) [10:53:51] RECOVERY - Host cephosd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [10:53:56] (03PS1) 10Zabe: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) [10:54:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:54:48] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1167548|PS.php: Disable secondary poolcounters for reboot (T395240)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:54:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:55:35] PROBLEM - Bird Internet Routing Daemon on cephosd2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:55:36] !log cgoubert@deploy1003 cgoubert: Continuing with sync [10:56:35] RECOVERY - Bird Internet Routing Daemon on cephosd2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:57:28] (03PS1) 10Majavah: dynamicproxy: Allow normal users to delete deprecated proxies [puppet] - 10https://gerrit.wikimedia.org/r/1167575 [10:58:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10987673 (10elukey) Note for myself: ` BIOS - Found a NIC device: P1_AIOMAOC_AG_i2LAN1OPROM Set PXE to the NIC P1_AIOMAOC_AG_i2LAN1OPROM BIOS: P1_AIOMAOC... [10:59:01] (03CR) 10Fabfur: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:59:21] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [11:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1100) [11:01:04] (03PS1) 10Ladsgroup: mariadb: Remove tables that are not cataloged from filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) [11:01:34] (03CR) 10LD: "@dreamyjazzwikipedia@gmail.com As noted in CorePermissions, using 'ukwiki' (without the +) might fully override the default configuration " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [11:02:02] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167548|PS.php: Disable secondary poolcounters for reboot (T395240)]] (duration: 09m 30s) [11:02:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:50] (03PS6) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:04:11] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [11:04:38] (03CR) 10Ladsgroup: mariadb: Remove tables that are not cataloged from filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [11:04:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10987687 (10elukey) I tried to reimage after two run of provision with uefi, and this is what I get: ` ┌────────────────────┤ [!!] Configure the network... [11:05:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987688 (10cmooney) >>! In T394333#10964951, @Andrew wrote: > That should be possible as long as I can get support with refactoring... [11:05:13] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1053.eqiad.wmnet with OS bookworm [11:05:24] (03PS7) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:06:08] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2006.codfw.wmnet [11:07:04] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10Spicerack: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#10987690 (10brouberol) To illustrate the proposal, this is one of many things you can do with an admin client: `lang=python >>> from ka... [11:07:42] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:08:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987693 (10Marostegui) >>! In T393042#10987612, @Jhancock.wm wrote: > @Marostegui it does have a hardware raid. Feel free to change it and reimage it to your liking. Would you be... [11:08:35] (03PS8) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:09:37] !log disable puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167530 [11:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:54] (03PS9) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:09:55] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2006.codfw.wmnet [11:10:14] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1007.eqiad.wmnet [11:11:32] (03PS10) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:11:52] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:12:07] (03PS5) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (https://phabricator.wikimedia.org/T399071) [11:13:25] (03CR) 10LD: "As I can't edit the patch, I suggested this here: https://phabricator.wikimedia.org/F63617907" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [11:13:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1007.eqiad.wmnet [11:14:49] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [11:14:52] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [11:15:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:15:47] (03CR) 10Fabfur: [C:03+2] cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:17:14] (03CR) 10CI reject: [V:04-1] PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:18:14] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987728 (10Marostegui) @Jhancock.wm this host isn't accessible, so I cannot even do anything with it. Do you think, if I provide you with a hostname we can go ahead and "treat it li... [11:18:28] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:18:46] (03PS2) 10Clément Goubert: PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) [11:18:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987729 (10Marostegui) >>! In T393042#10987728, @Marostegui wrote: > @Jhancock.wm this host isn't accessible, so I cannot even do anything with it. Do you think, if I provide you wi... [11:19:20] (03PS1) 10Fabfur: cache::haproxy: rename http frontend [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) [11:20:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:20:44] (03PS11) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:20:52] (03PS8) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [11:20:55] (03Merged) 10jenkins-bot: PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:21:21] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1167549|PS.php: Disable primary poolcounters for reboot (T395240)]] [11:22:23] (03PS12) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:23:28] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [11:23:29] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1167549|PS.php: Disable primary poolcounters for reboot (T395240)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:23:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [11:24:16] !log cgoubert@deploy1003 cgoubert: Continuing with sync [11:26:32] (03PS13) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:28:08] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:28:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:29:40] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167549|PS.php: Disable primary poolcounters for reboot (T395240)]] (duration: 08m 19s) [11:31:03] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [11:31:06] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [11:31:42] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2005.codfw.wmnet [11:32:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:32:08] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [11:33:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2002.codfw.wmnet [11:33:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'depool pc2011', diff saved to https://phabricator.wikimedia.org/P78821 and previous config saved to /var/cache/conftool/dbconfig/20250709-113322-marostegui.json [11:33:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:33:41] (03PS1) 10Btullis: ceph::osds - Use sdparm instead of hdparm to disable the write cache [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) [11:34:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'pool pc2011', diff saved to https://phabricator.wikimedia.org/P78823 and previous config saved to /var/cache/conftool/dbconfig/20250709-113413-marostegui.json [11:34:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1006.eqiad.wmnet [11:34:48] (03PS1) 10Fabfur: cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) [11:35:04] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:35:38] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2005.codfw.wmnet [11:35:48] (03PS1) 10Michael Große: Growth: Enable limiting Add Link for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) [11:35:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [11:37:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2002.codfw.wmnet [11:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'depool pc1', diff saved to https://phabricator.wikimedia.org/P78824 and previous config saved to /var/cache/conftool/dbconfig/20250709-113717-marostegui.json [11:38:03] (03PS2) 10Fabfur: cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) [11:38:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'pool pc1', diff saved to https://phabricator.wikimedia.org/P78826 and previous config saved to /var/cache/conftool/dbconfig/20250709-113831-marostegui.json [11:38:36] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1006.eqiad.wmnet [11:39:04] (03PS3) 10Clément Goubert: PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) [11:39:36] (03PS9) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [11:40:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:40:43] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [11:40:51] (03Merged) 10jenkins-bot: PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:41:03] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:41:17] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1167550|PS.php: Restore poolcounter config post-reboot (T395240)]] [11:41:33] (03CR) 10Fabfur: [C:03+2] cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:41:41] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1073 [11:42:00] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1073 [11:43:24] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1167550|PS.php: Restore poolcounter config post-reboot (T395240)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:43:26] (03CR) 10Hnowlan: [C:04-1] pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:43:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2003.codfw.wmnet [11:44:21] !log cgoubert@deploy1003 cgoubert: Continuing with sync [11:45:14] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [11:47:06] (03PS14) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:47:20] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:48:00] (03CR) 10Hnowlan: [C:03+1] pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:48:29] !log puppet enabled again on A:cp (T399071) [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] T399071: Split haproxy configuration in different files - https://phabricator.wikimedia.org/T399071 [11:48:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2003.codfw.wmnet [11:49:01] (03CR) 10Brouberol: [C:03+1] ceph::osds - Use sdparm instead of hdparm to disable the write cache [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:49:12] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:49:41] (03PS2) 10Fabfur: cache::haproxy: rename http frontend and backend to pristine name [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) [11:49:56] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167550|PS.php: Restore poolcounter config post-reboot (T395240)]] (duration: 08m 39s) [11:50:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:50:43] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename http frontend and backend to pristine name [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:50:52] (03PS10) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [11:51:03] (03CR) 10Jgiannelos: [C:03+2] pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:51:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987796 (10cmooney) 1050 and 1051 are now connected and ports up too. ` cmooney@cloudsw1-f4-eqiad> show interfaces descriptions | ma... [11:51:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:51:57] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:52:44] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:52:51] (03Merged) 10jenkins-bot: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:52:57] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:53:39] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:54:02] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [11:54:50] (03PS1) 10David Caro: toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 [11:55:06] (03CR) 10Btullis: [V:03+1 C:03+2] ceph::osds - Use sdparm instead of hdparm to disable the write cache [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:55:44] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:55:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:55:59] (03PS2) 10David Caro: toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 [11:56:04] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:56:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:56:49] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:57:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:57:14] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns cloudcephosd1048,49 - jclark@cumin1002" [11:57:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns cloudcephosd1048,49 - jclark@cumin1002" [11:57:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:57:52] jelto@cumin1003 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:57:59] (03PS1) 10Mhorsey: Add new script to update old freetext country data new schema [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) [11:58:24] (03CR) 10Fabfur: [C:03+2] cache::haproxy: rename http frontend and backend to pristine name [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:58:36] heads up, i am deploying some changes in changeprop [11:59:35] (03CR) 10Sergio Gimeno: [C:03+1] Growth: Enable limiting Add Link for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [12:00:36] (03CR) 10David Caro: "Allowed puppet to continue running in tools (expected warning message I think):" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:00:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [12:01:53] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [12:02:17] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:02:26] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:02:53] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:03:59] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:05:33] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:05:44] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [12:06:35] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116757 bytes in 1.455 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:07:02] (03CR) 10Majavah: [C:03+1] "I'm fine with removing the `ensure => latest` bit, but also wonder whether this definition should stay in `profile::toolforge::bastion` or" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:07:24] !log brouberol@cumin1003 END (FAIL) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=1) rolling restart_daemons on A:kafka-jumbo-eqiad [12:07:25] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:08:12] !log installing nginx security updates [12:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:20] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=97) rolling restart_daemons on A:kafka-jumbo-eqiad [12:09:19] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [12:09:33] ^ these are logged by test-cookbook, which performs no action [12:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:11:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:12:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:20] !log installing openjdk-17 security updates [12:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:15:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:16:15] (03PS3) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [12:16:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [12:17:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye [12:17:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.... [12:17:18] (03PS3) 10David Caro: toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 [12:17:19] (03PS1) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 [12:17:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.... [12:18:00] (03CR) 10CI reject: [V:04-1] toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [12:19:06] (03PS2) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 [12:21:18] (03CR) 10CI reject: [V:04-1] toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [12:22:15] (03CR) 10CI reject: [V:04-1] kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:25:52] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [12:25:55] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [12:27:22] (03CR) 10David Caro: "It's a package part of toolforge, that installs tools that you need for toolforge, I would even be tempted to rename the package `toolforg" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:30:19] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [12:30:26] (03CR) 10Majavah: "Yeah, the difference being is that misctools is specific to operations that need to happen on the bastion (`take` for example is specifica" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:30:45] (03CR) 10Majavah: [C:04-1] "As explained on -cloud-admin, I do not think this is a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [12:33:56] (03PS4) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [12:34:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [12:34:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [12:36:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:36:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:27] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 2.8.15 upgrade (T398720) [12:37:30] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [12:37:57] (03PS1) 10Hashar: Add readonly pugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1167605 (https://phabricator.wikimedia.org/T387833) [12:38:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [12:38:36] (03CR) 10Volans: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:39:15] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 2.8.15 upgrade (T398720) [12:39:17] (03PS2) 10Aqu: data-engineering: Refine switch over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [12:39:17] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:39:23] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1050 [12:39:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1050 [12:39:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1051 [12:39:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1051 [12:40:00] (03CR) 10Hashar: [C:03+2] Add readonly pugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1167605 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [12:40:16] (03PS3) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 [12:40:40] (03Merged) 10jenkins-bot: Add readonly pugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1167605 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [12:41:05] (03CR) 10Aqu: "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:41:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [12:41:47] (03PS5) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [12:41:55] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:42:38] (03CR) 10Xcollazo: "Thanks Ben. Could you please +2 when you have a min?" [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [12:44:01] (03CR) 10David Caro: [C:03+2] toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:44:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1051 [12:44:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1051 [12:47:29] (03CR) 10Daimona Eaytoy: [C:03+1] Add new script to update old freetext country data new schema [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [12:48:20] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:48:33] !log hashar@deploy1003 Started deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 [12:48:39] T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833 [12:48:44] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 (duration: 00m 11s) [12:48:45] jclark@cumin1002 reimage (PID 149651) is awaiting input [12:49:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987962 (10Jclark-ctr) @elukey i am having issues with 2 servers both fail to reimage after switching to 25g dac . cloudcephosd... [12:49:14] (03CR) 10Jgiannelos: [C:03+1] "Needs rebase and bump in version but lets try this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [12:50:35] !log hashar@deploy1003 Started deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 [12:50:46] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 (duration: 00m 10s) [12:53:40] (03CR) 10Btullis: [C:03+1] kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:54:11] !log installing jetty9 security updates [12:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:19] (03PS1) 10Brouberol: spicerack: add kafka-test-eqiad to spicerack/kafka/config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) [12:54:40] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet [12:54:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:55:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:55:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [12:55:18] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6214/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:55:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmne... [12:55:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987971 (10Jclark-ctr) [12:56:17] (03CR) 10Volans: "LGTM, couple of nits and a question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:57:11] (03CR) 10Btullis: [C:03+1] spicerack: add kafka-test-eqiad to spicerack/kafka/config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:57:15] (03CR) 10Federico Ceratto: "(replied few comments)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [12:57:22] (03PS1) 10KartikMistry: machinetranslation: staging: Update MinT to 2025-07-09-124154-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167608 (https://phabricator.wikimedia.org/T335491) [12:57:25] (03CR) 10Volans: [C:03+1] "This is great, thanks for adding it!" [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:57:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:58:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:58:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:58:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye [12:58:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1001.eqiad.wmnet [12:58:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmne... [12:58:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987977 (10Jclark-ctr) [12:59:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [12:59:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1300). [13:00:05] MichaelG_WMF and houseofm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] * MichaelG_WMF is here [13:00:15] (03PS3) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [13:00:21] o/ [13:00:58] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [13:00:58] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:01:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [13:01:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:02:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:49] (03PS6) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [13:02:51] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:03:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:03:48] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:04:52] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-07-02-122843 to 2025-07-08-183416 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167613 (https://phabricator.wikimedia.org/T397355) [13:04:56] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-07-02-123323 to 2025-07-09-124522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167614 (https://phabricator.wikimedia.org/T397355) [13:04:59] (03CR) 10Muehlenhoff: [C:03+2] New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:05:04] @MichaelG_WMF @HouseOfM I guess I can deploy [13:05:22] sergi0: thank you <3 [13:05:44] (03PS1) 10Vgutierrez: hiera: Deploy and enable measure cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) [13:05:54] (03CR) 10Brouberol: [V:03+1 C:03+2] spicerack: add kafka-test-eqiad to spicerack/kafka/config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:06:08] ty @sergi0 [13:06:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:07:45] I think we can do both together [13:07:47] @HouseOfM you'll run the script after deployment? [13:08:29] Daimona will be running it, but not committing the results yet [13:08:39] ack [13:08:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [13:08:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [13:09:36] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [13:09:47] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for thanos-be1006.mgmt:22 - https://phabricator.wikimedia.org/T399052#10988001 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated cable and reset idrac [13:09:49] (03Merged) 10jenkins-bot: Growth: Enable limiting Add Link for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [13:10:05] (03Merged) 10jenkins-bot: Add new script to update old freetext country data new schema [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [13:10:27] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1167592|Add new script to update old freetext country data new schema (T397270)]], [[gerrit:1167587|Growth: Enable limiting Add Link for dewiki (T396382)]] [13:10:32] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:10:32] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:11:02] (03CR) 10Brouberol: "That seems to be working! I tested it on kafka-test-eqiad, which has the following brokers:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:11:47] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=97) rolling restart_daemons on A:kafka-test-eqiad [13:12:04] (03PS2) 10Hnowlan: changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) [13:12:31] !log sgimeno@deploy1003 mhorsey, sgimeno, migr: Backport for [[gerrit:1167592|Add new script to update old freetext country data new schema (T397270)]], [[gerrit:1167587|Growth: Enable limiting Add Link for dewiki (T396382)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:14:01] I can see it working with the debug extension [13:14:01] (03CR) 10Marostegui: [C:03+1] ":(" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [13:14:29] (03PS2) 10Vgutierrez: hiera: Deploy and enable measure cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) [13:14:43] @HouseOfM should I sync already? Or is Daimona giving a try now? [13:14:49] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:14:51] Please sync [13:14:55] alright [13:14:56] (03CR) 10Andrew Bogott: [C:03+1] dynamicproxy: Allow normal users to delete deprecated proxies [puppet] - 10https://gerrit.wikimedia.org/r/1167575 (owner: 10Majavah) [13:15:01] from my side, we would be good to move forward too [13:15:03] Yep you can go ahead, thank you! [13:15:12] !log sgimeno@deploy1003 mhorsey, sgimeno, migr: Continuing with sync [13:15:16] (03CR) 10Marostegui: "Question, for the filtered tables, we have nothing to do regarding sanitarium right, this is transparent to any of it, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [13:16:33] (03PS1) 10Tiziano Fogli: pdb_resource_exporter: fix unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167619 (https://phabricator.wikimedia.org/T395442) [13:16:50] (03CR) 10Majavah: [C:03+2] dynamicproxy: Allow normal users to delete deprecated proxies [puppet] - 10https://gerrit.wikimedia.org/r/1167575 (owner: 10Majavah) [13:17:00] (03CR) 10Volans: "Nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:17:47] (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: fix unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167619 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [13:18:08] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:18:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:18:45] (03CR) 10Volans: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:19:45] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:20:27] (03CR) 10Hnowlan: [C:03+2] changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [13:20:35] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167592|Add new script to update old freetext country data new schema (T397270)]], [[gerrit:1167587|Growth: Enable limiting Add Link for dewiki (T396382)]] (duration: 10m 07s) [13:20:40] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:20:40] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:21:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1041', diff saved to https://phabricator.wikimedia.org/P78828 and previous config saved to /var/cache/conftool/dbconfig/20250709-132111-marostegui.json [13:21:23] changes are live! [13:21:51] (03PS7) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [13:21:56] thanks! [13:21:57] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1041.eqiad.wmnet with reason: Maintenance [13:22:12] (03Merged) 10jenkins-bot: changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [13:22:32] thanks! [13:24:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1041.eqiad.wmnet [13:25:19] (03PS9) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [13:25:33] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 2.8.15 upgrade (T398720) [13:25:37] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [13:26:35] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 2.8.15 upgrade (T398720) [13:27:06] (03CR) 10Ssingh: [C:03+1] "Verified per-site hieras." [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:28:11] Is that all for the backport window? If so, I'll run the script [13:29:04] (03CR) 10Volans: [C:03+1] "LGTM, thanks a lot!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:30:27] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1002.eqiad.wmnet [13:31:06] (03CR) 10Ssingh: [C:03+2] team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:31:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es1041.eqiad.wmnet [13:31:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1041.eqiad.wmnet with reason: Maintenance [13:31:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1041.eqiad.wmnet [13:32:09] (03CR) 10Brouberol: [C:03+2] kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:32:57] (03Merged) 10jenkins-bot: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:33:49] (03CR) 10Elukey: [C:03+1] cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 (owner: 10Volans) [13:34:06] (03CR) 10Volans: [C:03+2] cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 (owner: 10Volans) [13:34:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1002.eqiad.wmnet [13:34:37] jouncebot: nowandnext [13:34:38] For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1300) [13:34:38] In 0 hour(s) and 25 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1400) [13:35:29] (03CR) 10Vgutierrez: [C:03+2] hiera: Deploy and enable measure cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:35:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:36:00] (03PS1) 10DDesouza: Pre-deploy Readers Use Cases Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167622 (https://phabricator.wikimedia.org/T398870) [13:36:30] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki metawiki --exceptions countryExceptionMappings.csv [13:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:33] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:36:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1044.eqiad.wmnet with reason: Maintenance [13:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1044 for upgrade', diff saved to https://phabricator.wikimedia.org/P78829 and previous config saved to /var/cache/conftool/dbconfig/20250709-133639-marostegui.json [13:38:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1044.eqiad.wmnet [13:38:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1003.eqiad.wmnet [13:39:01] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki officewiki --exceptions countryExceptionMappings.csv [13:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:07] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki testwiki --exceptions countryExceptionMappings.csv [13:40:09] jouncebot: nowandnext [13:40:09] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1300) [13:40:09] In 0 hour(s) and 19 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1400) [13:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:17] (03CR) 10Zabe: [C:03+2] ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:40:19] (03CR) 10Zabe: [C:03+2] ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:40:20] (03CR) 10Zabe: [C:03+2] Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:40:22] (03CR) 10Zabe: [C:03+2] Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:40:37] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10Spicerack: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#10988166 (10Volans) An immediate workaround was implemented in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1167593 that giv... [13:40:59] (03PS11) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [13:41:31] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki test2wiki --exceptions countryExceptionMappings.csv [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:34] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:41:55] (03PS1) 10Hnowlan: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 [13:42:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167622 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [13:42:08] I am done. [13:42:17] (03Merged) 10jenkins-bot: cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 (owner: 10Volans) [13:42:23] (03CR) 10Jgiannelos: [C:03+1] Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 (owner: 10Hnowlan) [13:42:26] (03PS2) 10Hnowlan: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 [13:42:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:42:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:42:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:42:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:42:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1003.eqiad.wmnet [13:43:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es1041.eqiad.wmnet [13:43:20] PROBLEM - Host cephosd2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:51] (03PS3) 10Hnowlan: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 [13:44:54] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:44:56] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:44:56] PROBLEM - Host cephosd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:56] PROBLEM - Host cephosd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:09] (03CR) 10Dreamy Jazz: [C:04-1] "As a compromise, could we consider grouping the wikis which don't use `+` somewhere closer to the top of the list? It then makes it cleare" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [13:45:19] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:45:22] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:45:53] !log Depooling chartmuseum in codfw [13:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:04] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts.*,name=codfw [13:46:28] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [13:46:34] !log deploy measure/measure-goog certs in the upload CDN cluster - T394484 [13:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:37] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [13:46:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [13:47:57] looks like one pod misbehaving [13:47:58] RECOVERY - Host cephosd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [13:48:13] looking [13:48:24] RECOVERY - Host cephosd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [13:48:24] RECOVERY - Host cephosd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [13:48:30] yeah [13:48:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167555 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [13:48:38] PROBLEM - Bird Internet Routing Daemon on cephosd2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:48:41] I am going to test a new DNS box alert by stopping the VIP advertisement but keeping the service pooled. no impact is expected. will keep an eye out. [13:48:53] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations: sre.hosts.decommission often leaves dangling things in netbox - https://phabricator.wikimedia.org/T398052#10988206 (10taavi) →14Duplicate dup:03T398412 [13:48:56] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10988208 (10taavi) [13:49:04] hnowlan: delete it? [13:49:24] PROBLEM - Bird Internet Routing Daemon on cephosd2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:49:32] claime: just checking it out first [13:49:42] hnowlan: ack, all yours [13:49:55] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host es1044.eqiad.wmnet [13:50:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [13:50:26] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts.*,name=codfw [13:50:47] !log Depooling chartmuseum in eqiad [13:50:47] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7002.wikimedia.org,service=authdns-update [reason: testing alert] [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts.*,name=eqiad [13:51:08] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [13:51:43] (03PS1) 10Marostegui: reboot_es.sh: Reboot standalone external store [software] - 10https://gerrit.wikimedia.org/r/1167626 [13:51:57] (03Merged) 10jenkins-bot: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:52:03] (03Merged) 10jenkins-bot: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:52:05] jclark@cumin1002 reimage (PID 174304) is awaiting input [13:52:05] (03Merged) 10jenkins-bot: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:52:08] (03Merged) 10jenkins-bot: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:52:15] yeah, wedged for a while [13:52:19] deleted the pod [13:52:26] recurrence of https://phabricator.wikimedia.org/T374350 [13:52:27] ack [13:52:35] (03CR) 10Marostegui: "Federico, FYI, you can use this to reboot the pending RO hosts in external store." [software] - 10https://gerrit.wikimedia.org/r/1167626 (owner: 10Marostegui) [13:52:37] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167574|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167573|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167570|Fix categorylinks read new code for excluding categories (T398861 T398939)]], [[gerrit:1167569|Fix categorylinks read new code for excluding categories (T39886 [13:52:37] 1 T398939)]] [13:52:38] (03CR) 10Marostegui: [C:03+2] reboot_es.sh: Reboot standalone external store [software] - 10https://gerrit.wikimedia.org/r/1167626 (owner: 10Marostegui) [13:52:47] T399037: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T399037 [13:52:47] T398861: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398861 [13:52:48] T398939: DynamicPageList with notcategory producing duplicates - https://phabricator.wikimedia.org/T398939 [13:52:48] T39886: action=mobileview & page=Main_Page & sections=references returns HTTP 500 error - https://phabricator.wikimedia.org/T39886 [13:52:58] (03PS2) 10Muehlenhoff: Move docker-report from build2001 to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) [13:53:10] (03Merged) 10jenkins-bot: reboot_es.sh: Reboot standalone external store [software] - 10https://gerrit.wikimedia.org/r/1167626 (owner: 10Marostegui) [13:53:14] jclark@cumin1002 reimage (PID 172340) is awaiting input [13:53:30] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [13:53:42] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [13:53:43] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts.*,name=eqiad [13:53:47] (03CR) 10Hnowlan: [C:03+2] Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 (owner: 10Hnowlan) [13:54:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1050.eqiad.wmnet with OS bullseye [13:54:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [13:54:14] !log delete three wedged thumbor pods showing signs of T374350 [13:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:18] T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error - https://phabricator.wikimedia.org/T374350 [13:54:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988272 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [13:54:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [13:54:45] (03PS4) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [13:54:47] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167574|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167573|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167570|Fix categorylinks read new code for excluding categories (T398861 T398939)]], [[gerrit:1167569|Fix categorylinks read new code for excluding categories (T398861 T398939)]] synced [13:54:47] to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:51] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org,service=authdns-update [reason: testing alert] [13:55:04] !log sukhe@dns1004 START - running authdns-update [13:55:11] (03CR) 10Elukey: [C:03+2] Move docker-report from build2001 to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [13:55:16] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [13:55:39] (03Merged) 10jenkins-bot: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 (owner: 10Hnowlan) [13:55:48] !log sukhe@dns1004 END - running authdns-update [13:55:50] !log zabe@deploy1003 zabe: Continuing with sync [13:56:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [13:56:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [13:56:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:56:56] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply