[00:59:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [00:59:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [00:59:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [00:59:44] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:00:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:00:22] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:01:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:01:11] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:01:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:01:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:04:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:04:41] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:09:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:09:36] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:11:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:11:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [01:13:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:15:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [01:15:33] (03PS1) 10Andrew Bogott: ceph.py: fix detection of BOSS cards (os-hw-raid) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1190805 [01:51:12] (03CR) 10Andrew Bogott: [C:03+2] ceph.py: fix detection of BOSS cards (os-hw-raid) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1190805 (owner: 10Andrew Bogott) [01:54:51] (03Merged) 10jenkins-bot: ceph.py: fix detection of BOSS cards (os-hw-raid) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1190805 (owner: 10Andrew Bogott) [01:58:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [01:58:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [02:00:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [03:30:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:43:48] FIRING: PuppetFailure: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:43:58] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcumin1001:9100 - https://phabricator.wikimedia.org/T405434 (10phaultfinder) 03NEW [03:52:53] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-12 [03:54:16] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-12 [04:07:18] RESOLVED: PuppetFailure: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:50:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:26:08] !log godog@r5 toolsbeta START - Cookbook wmcs.nfs.add_server [06:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [06:34:20] !log godog@r5 toolsbeta END (PASS) - Cookbook wmcs.nfs.add_server (exit_code=0) [06:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [06:37:29] (03open) 10filippo: toolsbeta: flip NFS to Trixie VM [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 (https://phabricator.wikimedia.org/T404584) [06:46:11] !log component-configs@tools-bastion tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17968643293 [06:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [07:12:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [07:29:29] 06cloud-services-team, 10Toolforge, 10Tools, 13Patch-For-Review: Update toolserver.org redirects to use toolforge.org - https://phabricator.wikimedia.org/T271862#11209032 (10taavi) 05Open→03Resolved [07:40:59] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcumin1001:9100 - https://phabricator.wikimedia.org/T405434#11209053 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans Transient failure of git pull for the `cloud/wmcs-cookbooks` repository, self-resolved at the next puppet run. [07:46:07] (03merge) 10dcaro: quota: adapt the quota to the new default cpu [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/75 [07:48:29] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.182-20250924074622-dac7fb25 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) [07:52:47] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [07:58:31] (03open) 10dcaro: default_quota: bump version [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/76 [07:58:46] (03approved) 10dcaro: default_quota: bump version [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/76 [07:59:52] (03merge) 10dcaro: default_quota: bump version [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/76 [08:00:46] (03merge) 10filippo: toolsbeta: flip NFS to Trixie VM [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 (https://phabricator.wikimedia.org/T404584) [08:02:18] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) [08:02:21] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) [08:02:33] !log dcaro@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-kubeusers [08:02:35] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [08:03:27] !log godog@r5 toolsbeta START - Cookbook wmcs.nfs.migrate_service [08:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [08:05:21] !log godog@r5 toolsbeta END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99) [08:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [08:08:37] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:443 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:09:20] !log dcaro@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-kubeusers [08:11:27] FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [08:13:15] !log volans@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [08:19:14] !log godog@r5 toolsbeta START - Cookbook wmcs.nfs.migrate_service [08:19:14] !log godog@r5 toolsbeta END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99) [08:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [08:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [08:27:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [08:35:07] !log volans@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}' [08:35:49] !log volans@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1061.eqiad.wmnet}' [08:41:17] 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11209201 (10BTullis) All done for `an-redacteddb1001`. Thanks, all. [08:41:30] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-bastion-7 [08:41:58] 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11209202 (10BTullis) a:05BTullis→03SD0001 [08:42:05] !log filippo@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for toolsbeta-bastion-7 [08:46:12] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs - https://phabricator.wikimedia.org/T405280#11209230 (10dcaro) Last night we did pass the 1k sessions per backend, and the haproxies were... [08:49:57] RESOLVED: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [08:51:30] PROBLEM - Host cloudvirt1061 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:58] this is me running the cookbook, shouldn't the hos tbe downtimed? [08:53:06] !log volans@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1061.eqiad.wmnet}' [08:53:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1061 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [08:53:57] RECOVERY - Host cloudvirt1061 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [08:54:35] dcaro: do you know if this is expected? ^^^ [08:54:41] the first host I rebooted didn't alert [08:58:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1061 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:03:32] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-14 [09:03:34] !log filippo@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-14 [09:04:07] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-9, toolsbeta-test-k8s-worker-nfs-7 [09:04:08] !log filippo@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for toolsbeta-test-k8s-worker-nfs-9, toolsbeta-test-k8s-worker-nfs-7 [09:04:21] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-9,toolsbeta-test-k8s-worker-nfs-7 [09:04:22] !log filippo@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for toolsbeta-test-k8s-worker-nfs-9,toolsbeta-test-k8s-worker-nfs-7 [09:04:48] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-9 [09:07:02] !log volans@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' [09:08:37] 06cloud-services-team, 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: cloudcephosd1025 won't reimage - https://phabricator.wikimedia.org/T405258#11209262 (10Jclark-ctr) 05Open→03Resolved a:05dcaro→03Jclark-ctr Server completed Reimage by andrew [09:10:44] !log filippo@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-9 [09:11:38] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-7 [09:17:27] !log filippo@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-7 [09:19:14] !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-10, toolsbeta-test-k8s-worker-nfs-8, toolsbeta-test-k8s-worker-nfs-11 [09:26:26] PROBLEM - Host cloudvirt1062 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:49] !log volans@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}' [09:27:56] RECOVERY - Host cloudvirt1062 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:28:51] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [09:29:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:31:12] !log filippo@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-10, toolsbeta-test-k8s-worker-nfs-8, toolsbeta-test-k8s-worker-nfs-11 [09:32:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:33:07] RESOLVED: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:443 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:34:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:43:58] FIRING: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails [09:44:45] (03open) 10dcaro: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 [09:46:24] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209323 (10fgiunchedi) I can confirm this is still the case, `Profil... [09:46:28] FIRING: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:46:33] (03approved) 10filippo: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 (owner: 10dcaro) [09:47:06] (03approved) 10fnegri: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 (owner: 10dcaro) [09:49:51] (03merge) 10dcaro: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 [09:52:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:57:58] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:01:45] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209379 (10taavi) My understanding is that the reason for this is th... [10:02:58] FIRING: [3x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:06:41] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [10:07:58] FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:09:02] (03update) 10dcaro: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764) (owner: 10damian) [10:12:58] FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:14:59] 06cloud-services-team, 10Toolforge: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11209402 (10dcaro) > Intuitively toolforge envvars show TOOL_DB_HOST would just give me the value as I already specified the key, so telling me the key I just tool the tool is a bit redundant... ho... [10:17:28] RESOLVED: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails [10:17:58] RESOLVED: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [10:17:58] FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:22:57] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers [10:22:58] FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:26:13] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [10:32:36] (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764) [10:32:58] RESOLVED: [5x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:41:13] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [10:41:51] (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764) [10:46:31] (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764) [11:01:26] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [11:17:55] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: tf-infra-test misbehavior in codfw1dev - https://phabricator.wikimedia.org/T391718#11209553 (10taavi) 05Open→03Resolved Seems fixed? [11:21:08] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [tools,nfs,infra] Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11209569 (10fgiunchedi) `toolsbeta` NFS server upgrade happened today, not without issue, below the notes I took as... [11:22:37] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [11:27:35] 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11209584 (10SD0001) >>! In T404473#11209201, @BTullis wrote: > All done for `an-redacteddb1001`. Th... [11:27:37] 06cloud-services-team, 10Horizon: Keystone auth endpoint should use a standard HTTPS port - https://phabricator.wikimedia.org/T377055#11209585 (10taavi) This was done for Horizon at some point, but the Keystone service catalog and various hard-coded references still refer to the :25000 endpoint. @andrew do yo... [11:28:31] 06cloud-services-team, 10Cloud-VPS: nova-api can get the listen queue of socket full - https://phabricator.wikimedia.org/T362956#11209589 (10taavi) 05Open→03Resolved This has not happened recently. [11:30:56] 06cloud-services-team: Replace 'download' cloud-vps project after we support per-tool object storage - https://phabricator.wikimedia.org/T367593#11209601 (10taavi) [11:31:06] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge,storage] Provide per-tool access to cloud-vps object storage - https://phabricator.wikimedia.org/T358496#11209602 (10taavi) [11:31:10] 06cloud-services-team: Replace 'download' cloud-vps project after we support per-tool object storage - https://phabricator.wikimedia.org/T367593#11209603 (10taavi) 05Stalled→03Open [11:34:32] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209619 (10fgiunchedi) Ok if we have systemd-networkd everywhere the... [11:44:36] 10Cloud-VPS, 06tools-infrastructure-team: Allow customizing which mounts to enable per VM - https://phabricator.wikimedia.org/T405462 (10taavi) 03NEW p:05Triage→03Medium [11:53:30] (03update) 10raymond-ndibe: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [11:56:18] 10Toolforge (Toolforge iteration 24): [jobs-api] pod cpu request greater than limitrange in lima-kilo, broken - https://phabricator.wikimedia.org/T405463 (10Raymond_Ndibe) 03NEW [12:03:14] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [12:06:59] !log component-configs@tools-bastion tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17976062312 [12:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [12:11:25] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers [12:11:48] !log component-configs@tools-bastion tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17976182663 [12:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [12:14:02] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [12:28:38] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers [12:29:20] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [12:34:10] 10Cloud-VPS, 06tools-infrastructure-team, 13Patch-For-Review: Allow customizing which mounts to enable per VM - https://phabricator.wikimedia.org/T405462#11209800 (10taavi) 05Open→03Declined Filippo convinced me that this is probably not that useful. [12:34:25] (03open) 10damian: Change `show` command to output raw value [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/92 (https://phabricator.wikimedia.org/T405024) [12:35:32] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11209810 (10DamianZaremba) > I think it should be ok to only show the value with toolforge envvars show MYVAR https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/... [12:36:08] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers [12:37:45] (03open) 10dcaro: jobs: lower the default cpu limit to 1cpu [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/217 [12:40:17] (03approved) 10dcaro: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:40:22] (03merge) 10dcaro: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:40:32] (03update) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:46:11] !log component-configs tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17977065589 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [12:46:13] wm-bot2: Unknown project "component-configs" [12:51:58] !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17977211158 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [12:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [12:54:23] (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764) [12:54:35] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [13:04:25] (03update) 10raymond-ndibe: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:06:17] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [13:06:53] 06cloud-services-team, 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11209921 (10A_smart_kitten) (if this is UBN then presumably WMCS should be aware of this task) [13:07:07] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [13:08:27] (03update) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:08:47] !log dcaro@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-cli [13:21:36] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [13:26:25] (03update) 10dcaro: Change `show` command to output raw value [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/92 (https://phabricator.wikimedia.org/T405024) (owner: 10damian) [13:30:28] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11209977 (10DamianZaremba) > (needs needs review label, needing someone with label access) thank you dcaro for doing the needful [13:33:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:34:18] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [13:43:33] (03approved) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:44:14] (03update) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:44:21] (03merge) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:47:38] (03approved) 10dcaro: d/changelog: bump to 16.1.21 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/131 (https://phabricator.wikimedia.org/T404726) [13:47:43] (03merge) 10dcaro: d/changelog: bump to 16.1.21 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/131 (https://phabricator.wikimedia.org/T404726) [13:47:47] FIRING: NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1005 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:52:47] RESOLVED: NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1005 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:52:59] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210085 (10dcaro) Default cpu requests reduced to 100m, and 1cpu limit, we are currently down to ~65% cpu request allocation (from >80%), should still... [13:54:49] 06cloud-services-team, 10Toolforge: Toolforge jobs for milhistbot not running - https://phabricator.wikimedia.org/T405310#11210095 (10dcaro) @Hawkeye7 The changes in the default resources were applied today, tonight you should see them back to the usual triggering delay. [13:55:45] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210101 (10MBH) I'm sorry, but what measurement unit is m? Also, little offtopic: is this true that default mem limit now is 1G, maximal limit is 4G,... [13:57:45] 10Toolforge (Toolforge iteration 24): [jobs-api] pod cpu request greater than limitrange in lima-kilo, broken - https://phabricator.wikimedia.org/T405463#11210121 (10dcaro) 05Open→03Invalid This is not an issue, you should use the toolforge-deploy unmerged MR for that component if there's any. [14:07:38] PROBLEM - Host cloudcephmon1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:08] RECOVERY - Host cloudcephmon1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [14:14:16] 06cloud-services-team, 07Epic: Action items and work for retro 20190403 - https://phabricator.wikimedia.org/T220020#11210195 (10taavi) 05Open→03Invalid This does not seem like a useful tracking task after 6 years :-) [14:14:32] 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11210197 (10fnegri) @SD0001 apologies, I think I did something wrong yesterday and the change was n... [14:15:44] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [toolforge-cli.build] Implement a --json flag to output machine-readable output - https://phabricator.wikimedia.org/T334589#11210206 (10taavi) [14:17:01] (03open) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218 [14:29:56] (03update) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218 [14:39:45] (03open) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/219 [14:40:02] (03close) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/219 [14:47:34] (03open) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 [14:51:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:54:22] (03update) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) [14:54:25] (03approved) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) [14:54:50] (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 (https://phabricator.wikimedia.org/T403321) [14:55:34] (03update) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218 (https://phabricator.wikimedia.org/T403321) [14:57:45] (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 (https://phabricator.wikimedia.org/T403321) [14:57:51] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [14:59:38] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478 (10fgiunchedi) 03NEW [14:59:48] 06cloud-services-team, 10Cloud-VPS, 10Ceph: [ceph,eqiad1] upgrade from quincy->reef (and bookworm) - https://phabricator.wikimedia.org/T404249#11210392 (10Andrew) 05Open→03Resolved ` root@cloudcephmon1004:~# ceph versions { "mon": { "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b5810... [15:01:41] (03open) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321) [15:02:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:02:21] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [15:03:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.99% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:03:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:03:38] andrew@cloudcumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:05:01] (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 (https://phabricator.wikimedia.org/T403321) [15:05:59] (03update) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321) [15:06:30] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11210428 (10Andrew) [15:11:49] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11210449 (10Andrew) 1050 and 1051 won't be pooled immediately, they're being reserved for T405478 [15:13:34] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.976% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:15:45] 06cloud-services-team, 10Openstack-Magnum: ssh to cloud-vps 'utility' nodes (magnum, trove, octavia) - https://phabricator.wikimedia.org/T402317#11210466 (10Andrew) > - Install Trove root keys on cloudcontrols, as octavia root keys are already > This is now done for octavia, trove, and PAWs. Is there dema... [15:32:52] (03open) 10damian: offload job restart logic to jobs-api [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/133 (https://phabricator.wikimedia.org/T403321) [15:33:13] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [components-api] restart rather than delete/create continuous jobs - https://phabricator.wikimedia.org/T403321#11210549 (10DamianZaremba) Act 1: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218 (doesn't solve this, but so... [15:35:17] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210572 (10dcaro) >>! In T404726#11210101, @MBH wrote: > I'm sorry, but what measurement unit is m? That's 'milli-cpu', as in 1m = 0.001cpu (details here https://kubernetes... [15:56:28] (03update) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321) [16:02:46] 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Platform-SRE, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11210697 (10Ottomata) [16:08:25] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210721 (10DamianZaremba) I see the quota was bumped (https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/maintain-kubeusers/values/to... [16:11:58] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11210749 (10Sakretsu) The job got stuck again around 2025-09-22T21:20:59Z. I can't say if it's related to the NFS issue. ` tools.itwiki@tools-bastion-15:~/draftbot$ kubectl get po... [16:19:12] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.014 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:21:10] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 55.024 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:22:50] (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 [16:23:11] (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 [16:23:59] (03update) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321) [16:24:25] (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 [16:25:28] (03update) 10raymond-ndibe: [deploy_task, tool_handlers] queue deployments to allow creation of multiple deployments at once [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/131 (https://phabricator.wikimedia.org/T402568) [16:32:14] 06cloud-services-team, 10Toolforge: toolforge buildservice based tool is not starting - https://phabricator.wikimedia.org/T405319#11210896 (10dcaro) @santhosh can you retry now? Already changed the default resources for the cluster, things should go way smoother. [16:34:07] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210924 (10dcaro) @DamianZaremba looking [16:35:28] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11210937 (10DamianZaremba) https://k8s-status.toolforge.org/namespaces/tool-itwiki/pods/itwiki-draftbot-continuous-76fcff44b5-5q6wc/ shows this on tools-k8s-worker-nfs-73 which doe... [16:46:21] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11211017 (10dcaro) >>! In T404726#11210924, @dcaro wrote: > @DamianZaremba looking Done, I think it probably has updated it already, but if not it will take a minute. [16:46:58] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11211033 (10DamianZaremba) > Done, I think it probably has updated it already, but if not it will take a minute. Looks good, thanks [16:49:11] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11211054 (10dcaro) Yep, the process got stuck on NFS, restarting, will move the pod to a different node [16:49:15] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [components-api] restart rather than delete/create continuous jobs - https://phabricator.wikimedia.org/T403321#11211055 (10DamianZaremba) @dcaro could you tag those as needing review when you have a min... sorry for picking you, but since this is ultima... [16:50:40] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-73 (T400957) [16:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:50:46] T400957: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957 [16:53:40] (03update) 10raymond-ndibe: [deploy_task, tool_handlers] queue deployments to allow creation of multiple deployments at once [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/131 (https://phabricator.wikimedia.org/T402568) [16:53:40] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11211111 (10dcaro) [16:55:13] !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983740752 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [16:55:14] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983737370 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [16:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [16:55:27] !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983743046 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [16:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL [16:56:12] !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983738823 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [16:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL [16:57:20] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-73 (T400957) [16:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:57:28] T400957: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957 [16:59:49] (03update) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129 [17:00:12] (03approved) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129 [17:03:39] (03update) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129 [17:03:47] (03merge) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129 [17:04:37] (03open) 10dcaro: d/changelog: bump to 16.1.22 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/132 (https://phabricator.wikimedia.org/T390136) [17:05:04] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [17:06:50] !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984009139 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [17:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL [17:06:52] !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984009157 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [17:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [17:07:45] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-43 [17:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:08:12] !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984009170 (https://github.com/cluebotng/component-configs/commits/refs/heads/main) [17:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL [17:10:16] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-43 [17:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:13:16] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [17:23:56] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [17:28:50] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-43 [17:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:30:46] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-43 [17:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:32:43] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [17:32:46] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [17:38:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:40:02] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820831 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960) [17:40:05] !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820847 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960) [17:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [17:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL [17:40:23] !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820837 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960) [17:40:24] !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820843 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960) [17:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [17:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL [17:40:26] !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820840 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960) [17:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL [17:41:23] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [17:48:45] !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985011099 (https://github.com/cluebotng/component-configs/commits/f38bf48e73ce94da03cee36fb6cccbe483786e19) [17:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL [17:52:57] !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985077157 (https://github.com/cluebotng/component-configs/commits/955cd61533af834d56d36117a81643d8ab9ba81f) [17:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL [17:52:58] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985077117 (https://github.com/cluebotng/component-configs/commits/955cd61533af834d56d36117a81643d8ab9ba81f) [17:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [17:57:29] !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198507 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4) [17:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL [17:57:36] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198515 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4) [17:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [17:57:38] !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198509 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4) [17:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL [17:58:35] !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198575 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4) [17:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL [17:59:51] !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198519 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4) [17:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL [18:03:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:05:00] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985333827 (https://github.com/cluebotng/component-configs/commits/24f56031bca40f808e05920eb128fd892a3d9b92) [18:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:05:18] !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985333851 (https://github.com/cluebotng/component-configs/commits/24f56031bca40f808e05920eb128fd892a3d9b92) [18:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL [18:07:27] (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [18:08:00] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985386789 (https://github.com/cluebotng/component-configs/commits/a1633fc7ecf1bd4a310fcb03d02d0da81212427a) [18:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:11:13] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985446508 (https://github.com/cluebotng/component-configs/commits/34320f16093c0d4160ce346108bc14bfb95245f8) [18:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:14:47] 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211409 (10DamianZaremba) Pretty sure I just observed image caching in the wild ` tools.cluebotng-trainer@tools-bastion-15:~$ toolforge build show Build ID: cluebotng-trainer-buildpacks-pipelinerun... [18:15:27] 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211410 (10DamianZaremba) I'll try and have a look at getting this done once T403321 is merged. [18:19:54] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985728221 (https://github.com/cluebotng/component-configs/commits/34320f16093c0d4160ce346108bc14bfb95245f8) [18:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:23:06] 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211459 (10DamianZaremba) ` tools.cluebotng-trainer@tools-bastion-15:~$ toolforge components deployment show Warning: You are using a beta feature of Toolforge. Deployment ID: 20250924-181937-2kh9i... [18:25:20] 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211463 (10DamianZaremba) Ah, so this is because the component changed from `trainer` to `coordinator` but the previous `CronJob` was not deleted... so false alarm on caching [18:26:24] 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211467 (10DamianZaremba) Just for any casual observer, it is actually working; ` tools.cluebotng-trainer@tools-bastion-15:~$ kubectl create job --from=cronjob/coordinator coordinator job.batch/coo... [18:31:28] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985989347 (https://github.com/cluebotng/component-configs/commits/38b2f2235d143f0522d560a93c04e50e8fa38c77) [18:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:34:27] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17986015781 (https://github.com/cluebotng/component-configs/commits/b0f088ec49a2156a18ad7d78d18640f6e8fe943c) [18:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:36:30] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17986145339 (https://github.com/cluebotng/component-configs/commits/820291b11edc6256cfa07e9a4a73677df86e52d8) [18:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [18:40:16] (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [18:40:18] (03approved) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [18:48:52] (03approved) 10raymond-ndibe: jobs: lower the default cpu limit to 1cpu [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/217 (owner: 10dcaro) [18:51:37] (03update) 10raymond-ndibe: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 (owner: 10dcaro) [18:55:03] (03update) 10raymond-ndibe: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 (owner: 10dcaro) [19:06:14] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE, 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11211573 (10cmooney) I can confirm the switch is already set to accept tagged traffic for t... [19:07:09] 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11211578 (10Andrew) Here's another example: Cloudcephosd1016 is using around 29G or RAM according to grafana. ` -7 13.97253 host cloudcephosd1016... [19:31:30] 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11211697 (10Andrew) So... it doesn't look like ceph will make use of more RAM even if we offer it up. If it did, changing the cluster-wide setting from 6 to 8 would run the risk of... [19:48:59] 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11211733 (10Andrew) For now I'm leaving codfw1 with osd_memory_target_autotune=true and eqiad1 with osd_memory_target_autotune=false and osd_memory_target=6442450944 -- I'm not con... [20:03:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:13:09] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE, 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11211785 (10Andrew) @fgiunchedi, 1050 and 1051 should already be fully puppetized with Ceph... [20:14:32] (03PS1) 10Andrew Bogott: inventory: update expected ceph version to 18 [cloud/wmcs-cookbooks] - 10https:[23:08:58] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Update maintain_kubeusers to use the toolstate database - https://phabricator.wikimedia.org/T334629#11212421 (10Raymond_Ndibe) [23:18:44] (03PS1) 10Stevemunene: idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073) [23:30:18] (03open) 10damian: getBuild - add digest to image [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/144 (https://phabricator.wikimedia.org/T403322)