[00:59:09] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[00:59:18] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[00:59:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[00:59:44] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:00:13] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:00:22] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:01:03] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:01:11] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:01:16] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:01:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:04:32] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:04:41] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:09:27] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:09:36] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:11:07] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:11:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99)
[01:13:46] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:15:02] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0)
[01:15:33] <wikibugs>	 (03PS1) 10Andrew Bogott: ceph.py: fix detection of BOSS cards (os-hw-raid) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1190805
[01:51:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] ceph.py: fix detection of BOSS cards (os-hw-raid) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1190805 (owner: 10Andrew Bogott)
[01:54:51] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.py: fix detection of BOSS cards (os-hw-raid) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1190805 (owner: 10Andrew Bogott)
[01:58:05] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate
[01:58:39] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0)
[02:00:48] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node
[03:30:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[03:43:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:43:58] <wikibugs>	 06cloud-services-team: PuppetFailure Puppet has failed on cloudcumin1001:9100 - https://phabricator.wikimedia.org/T405434 (10phaultfinder) 03NEW
[03:52:53] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-12
[03:54:16] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-12
[04:07:18] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:50:03] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[06:26:08] <wm-bot2>	 !log godog@r5 toolsbeta START - Cookbook wmcs.nfs.add_server
[06:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[06:34:20] <wm-bot2>	 !log godog@r5 toolsbeta END (PASS) - Cookbook wmcs.nfs.add_server (exit_code=0)
[06:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[06:37:29] <wikibugs>	 (03open) 10filippo: toolsbeta: flip NFS to Trixie VM [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 (https://phabricator.wikimedia.org/T404584)
[06:46:11] <wm-bot2>	 !log component-configs@tools-bastion tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17968643293
[06:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[07:12:41] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0)
[07:29:29] <wikibugs>	 06cloud-services-team, 10Toolforge, 10Tools, 13Patch-For-Review: Update toolserver.org redirects to use toolforge.org - https://phabricator.wikimedia.org/T271862#11209032 (10taavi) 05Open→03Resolved
[07:40:59] <wikibugs>	 06cloud-services-team: PuppetFailure Puppet has failed on cloudcumin1001:9100 - https://phabricator.wikimedia.org/T405434#11209053 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans Transient failure of git pull for the `cloud/wmcs-cookbooks` repository, self-resolved at the next puppet run.
[07:46:07] <wikibugs>	 (03merge) 10dcaro: quota: adapt the quota to the new default cpu [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/75
[07:48:29] <wikibugs>	 (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.182-20250924074622-dac7fb25 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726)
[07:52:47] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[07:58:31] <wikibugs>	 (03open) 10dcaro: default_quota: bump version [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/76
[07:58:46] <wikibugs>	 (03approved) 10dcaro: default_quota: bump version [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/76
[07:59:52] <wikibugs>	 (03merge) 10dcaro: default_quota: bump version [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/76
[08:00:46] <wikibugs>	 (03merge) 10filippo: toolsbeta: flip NFS to Trixie VM [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 (https://phabricator.wikimedia.org/T404584)
[08:02:18] <wikibugs>	 (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726)
[08:02:21] <wikibugs>	 (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726)
[08:02:33] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-kubeusers
[08:02:35] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[08:03:27] <wm-bot2>	 !log godog@r5 toolsbeta START - Cookbook wmcs.nfs.migrate_service
[08:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[08:05:21] <wm-bot2>	 !log godog@r5 toolsbeta END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99)
[08:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[08:08:37] <wmcs-alerts>	 FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:443 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[08:09:20] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-kubeusers
[08:11:27] <wmcs-alerts>	 FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown
[08:13:15] <logmsgbot_cloud>	 !log volans@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}'
[08:19:14] <wm-bot2>	 !log godog@r5 toolsbeta START - Cookbook wmcs.nfs.migrate_service
[08:19:14] <wm-bot2>	 !log godog@r5 toolsbeta END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99)
[08:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[08:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL
[08:27:21] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[08:35:07] <logmsgbot_cloud>	 !log volans@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1060.eqiad.wmnet}'
[08:35:49] <logmsgbot_cloud>	 !log volans@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1061.eqiad.wmnet}'
[08:41:17] <wikibugs>	 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11209201 (10BTullis) All done for `an-redacteddb1001`. Thanks, all.
[08:41:30] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-bastion-7
[08:41:58] <wikibugs>	 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11209202 (10BTullis) a:05BTullis→03SD0001
[08:42:05] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for toolsbeta-bastion-7
[08:46:12] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs - https://phabricator.wikimedia.org/T405280#11209230 (10dcaro) Last night we did pass the 1k sessions per backend, and the haproxies were...
[08:49:57] <wmcs-alerts>	 RESOLVED: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown
[08:51:30] <icinga-wm>	 PROBLEM - Host cloudvirt1061 is DOWN: PING CRITICAL - Packet loss = 100%
[08:51:58] <volans>	 this is me running the cookbook, shouldn't the hos tbe downtimed?
[08:53:06] <logmsgbot_cloud>	 !log volans@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1061.eqiad.wmnet}'
[08:53:49] <jinxer-wm>	 FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1061 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown
[08:53:57] <icinga-wm>	 RECOVERY - Host cloudvirt1061 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[08:54:35] <volans>	 dcaro: do you know if this is expected? ^^^
[08:54:41] <volans>	 the first host I rebooted didn't alert
[08:58:49] <jinxer-wm>	 RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1061 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown
[09:03:32] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-14
[09:03:34] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-14
[09:04:07] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for ￼toolsbeta-test-k8s-worker-nfs-9, toolsbeta-test-k8s-worker-nfs-7
[09:04:08] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for ￼toolsbeta-test-k8s-worker-nfs-9, toolsbeta-test-k8s-worker-nfs-7
[09:04:21] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for ￼toolsbeta-test-k8s-worker-nfs-9,toolsbeta-test-k8s-worker-nfs-7
[09:04:22] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for ￼toolsbeta-test-k8s-worker-nfs-9,toolsbeta-test-k8s-worker-nfs-7
[09:04:48] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-9
[09:07:02] <logmsgbot_cloud>	 !log volans@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}'
[09:08:37] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: cloudcephosd1025 won't reimage - https://phabricator.wikimedia.org/T405258#11209262 (10Jclark-ctr) 05Open→03Resolved a:05dcaro→03Jclark-ctr Server completed Reimage by andrew
[09:10:44] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-9
[09:11:38] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-7
[09:17:27] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-7
[09:19:14] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-10, toolsbeta-test-k8s-worker-nfs-8, toolsbeta-test-k8s-worker-nfs-11
[09:26:26] <icinga-wm>	 PROBLEM - Host cloudvirt1062 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:49] <logmsgbot_cloud>	 !log volans@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1062.eqiad.wmnet}'
[09:27:56] <icinga-wm>	 RECOVERY - Host cloudvirt1062 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[09:28:51] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[09:29:49] <jinxer-wm>	 FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown
[09:31:12] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-10, toolsbeta-test-k8s-worker-nfs-8, toolsbeta-test-k8s-worker-nfs-11
[09:32:28] <wmcs-alerts>	 FIRING: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[09:33:07] <wmcs-alerts>	 RESOLVED: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:443 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[09:34:49] <jinxer-wm>	 RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1062 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown
[09:43:58] <wmcs-alerts>	 FIRING: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails
[09:44:45] <wikibugs>	 (03open) 10dcaro: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82
[09:46:24] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209323 (10fgiunchedi) I can confirm this is still the case, `Profil...
[09:46:28] <wmcs-alerts>	 FIRING: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure
[09:46:33] <wikibugs>	 (03approved) 10filippo: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 (owner: 10dcaro)
[09:47:06] <wikibugs>	 (03approved) 10fnegri: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82 (owner: 10dcaro)
[09:49:51] <wikibugs>	 (03merge) 10dcaro: volumes: delete unused volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82
[09:52:28] <wmcs-alerts>	 RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[09:57:58] <wmcs-alerts>	 FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:01:45] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209379 (10taavi) My understanding is that the reason for this is th...
[10:02:58] <wmcs-alerts>	 FIRING: [3x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:06:41] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[10:07:58] <wmcs-alerts>	 FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-bastion-7 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:09:02] <wikibugs>	 (03update) 10dcaro: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764) (owner: 10damian)
[10:12:58] <wmcs-alerts>	 FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:14:59] <wikibugs>	 06cloud-services-team, 10Toolforge: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11209402 (10dcaro) > Intuitively toolforge envvars show TOOL_DB_HOST would just give me the value as I already specified the key, so telling me the key I just tool the tool is a bit redundant... ho...
[10:17:28] <wmcs-alerts>	 RESOLVED: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails
[10:17:58] <wmcs-alerts>	 RESOLVED: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure
[10:17:58] <wmcs-alerts>	 FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:22:57] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers
[10:22:58] <wmcs-alerts>	 FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:26:13] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[10:32:36] <wikibugs>	 (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764)
[10:32:58] <wmcs-alerts>	 RESOLVED: [5x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-mail-2 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[10:41:13] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers
[10:41:51] <wikibugs>	 (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764)
[10:46:31] <wikibugs>	 (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764)
[11:01:26] <wikibugs>	 (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213
[11:17:55] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: tf-infra-test misbehavior in codfw1dev - https://phabricator.wikimedia.org/T391718#11209553 (10taavi) 05Open→03Resolved Seems fixed?
[11:21:08] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [tools,nfs,infra] Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11209569 (10fgiunchedi) `toolsbeta` NFS server upgrade happened today, not without issue, below the notes I took as...
[11:22:37] <wikibugs>	 (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213
[11:27:35] <wikibugs>	 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11209584 (10SD0001) >>! In T404473#11209201, @BTullis wrote: > All done for `an-redacteddb1001`. Th...
[11:27:37] <wikibugs>	 06cloud-services-team, 10Horizon: Keystone auth endpoint should use a standard HTTPS port - https://phabricator.wikimedia.org/T377055#11209585 (10taavi) This was done for Horizon at some point, but the Keystone service catalog and various hard-coded references still refer to the  :25000 endpoint. @andrew do yo...
[11:28:31] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: nova-api can get the listen queue of socket full - https://phabricator.wikimedia.org/T362956#11209589 (10taavi) 05Open→03Resolved This has not happened recently.
[11:30:56] <wikibugs>	 06cloud-services-team: Replace 'download' cloud-vps project after we support per-tool object storage - https://phabricator.wikimedia.org/T367593#11209601 (10taavi)
[11:31:06] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge,storage] Provide per-tool access to cloud-vps object storage - https://phabricator.wikimedia.org/T358496#11209602 (10taavi)
[11:31:10] <wikibugs>	 06cloud-services-team: Replace 'download' cloud-vps project after we support per-tool object storage - https://phabricator.wikimedia.org/T367593#11209603 (10taavi) 05Stalled→03Open
[11:34:32] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209619 (10fgiunchedi) Ok if we have systemd-networkd everywhere the...
[11:44:36] <wikibugs>	 10Cloud-VPS, 06tools-infrastructure-team: Allow customizing which mounts to enable per VM - https://phabricator.wikimedia.org/T405462 (10taavi) 03NEW p:05Triage→03Medium
[11:53:30] <wikibugs>	 (03update) 10raymond-ndibe: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[11:56:18] <wikibugs>	 10Toolforge (Toolforge iteration 24): [jobs-api] pod cpu request greater than limitrange in lima-kilo, broken - https://phabricator.wikimedia.org/T405463 (10Raymond_Ndibe) 03NEW
[12:03:14] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[12:06:59] <wm-bot2>	 !log component-configs@tools-bastion tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17976062312
[12:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[12:11:25] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers
[12:11:48] <wm-bot2>	 !log component-configs@tools-bastion tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17976182663
[12:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[12:14:02] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[12:28:38] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers
[12:29:20] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers
[12:34:10] <wikibugs>	 10Cloud-VPS, 06tools-infrastructure-team, 13Patch-For-Review: Allow customizing which mounts to enable per VM - https://phabricator.wikimedia.org/T405462#11209800 (10taavi) 05Open→03Declined Filippo convinced me that this is probably not that useful.
[12:34:25] <wikibugs>	 (03open) 10damian: Change `show` command to output raw value [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/92 (https://phabricator.wikimedia.org/T405024)
[12:35:32] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11209810 (10DamianZaremba) > I think it should be ok to only show the value with toolforge envvars show MYVAR https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/...
[12:36:08] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers
[12:37:45] <wikibugs>	 (03open) 10dcaro: jobs: lower the default cpu limit to 1cpu [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/217
[12:40:17] <wikibugs>	 (03approved) 10dcaro: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[12:40:22] <wikibugs>	 (03merge) 10dcaro: maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[12:40:32] <wikibugs>	 (03update) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[12:46:11] <wm-bot2>	 !log component-configs tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17977065589 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[12:46:13] <stashbot>	 wm-bot2: Unknown project "component-configs"
[12:51:58] <wm-bot2>	 !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17977211158 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[12:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[12:54:23] <wikibugs>	 (03update) 10damian: Expand `source` support in `ToolConfig` [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T402764)
[12:54:35] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api
[13:04:25] <wikibugs>	 (03update) 10raymond-ndibe: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:06:17] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api
[13:06:53] <wikibugs>	 06cloud-services-team, 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11209921 (10A_smart_kitten) (if this is UBN then presumably WMCS should be aware of this task)
[13:07:07] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli
[13:08:27] <wikibugs>	 (03update) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:08:47] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-cli
[13:21:36] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api
[13:26:25] <wikibugs>	 (03update) 10dcaro: Change `show` command to output raw value [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/92 (https://phabricator.wikimedia.org/T405024) (owner: 10damian)
[13:30:28] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11209977 (10DamianZaremba) > (needs needs review label, needing someone with label access) thank you dcaro for doing the needful
[13:33:09] <jinxer-wm>	 FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[13:34:18] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api
[13:43:33] <wikibugs>	 (03approved) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:44:14] <wikibugs>	 (03update) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:44:21] <wikibugs>	 (03merge) 10dcaro: jobs-api: bump to 0.0.416-20250923131926-45b7b4f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977 (https://phabricator.wikimedia.org/T404726) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:47:38] <wikibugs>	 (03approved) 10dcaro: d/changelog: bump to 16.1.21 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/131 (https://phabricator.wikimedia.org/T404726)
[13:47:43] <wikibugs>	 (03merge) 10dcaro: d/changelog: bump to 16.1.21 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/131 (https://phabricator.wikimedia.org/T404726)
[13:47:47] <jinxer-wm>	 FIRING: NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1005 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[13:52:47] <jinxer-wm>	 RESOLVED: NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1005 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[13:52:59] <wikibugs>	 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210085 (10dcaro) Default cpu requests reduced to 100m, and 1cpu limit, we are currently down to ~65% cpu request allocation (from >80%), should still...
[13:54:49] <wikibugs>	 06cloud-services-team, 10Toolforge: Toolforge jobs for milhistbot not running - https://phabricator.wikimedia.org/T405310#11210095 (10dcaro) @Hawkeye7 The changes in the default resources were applied today, tonight you should see them back to the usual triggering delay.
[13:55:45] <wikibugs>	 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210101 (10MBH) I'm sorry, but what measurement unit is m?  Also, little offtopic: is this true that default mem limit now is 1G, maximal limit is 4G,...
[13:57:45] <wikibugs>	 10Toolforge (Toolforge iteration 24): [jobs-api] pod cpu request greater than limitrange in lima-kilo, broken - https://phabricator.wikimedia.org/T405463#11210121 (10dcaro) 05Open→03Invalid This is not an issue, you should use the toolforge-deploy unmerged MR for that component if there's any.
[14:07:38] <icinga-wm>	 PROBLEM - Host cloudcephmon1005 is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:08] <icinga-wm>	 RECOVERY - Host cloudcephmon1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[14:14:16] <wikibugs>	 06cloud-services-team, 07Epic: Action items and work for retro 20190403 - https://phabricator.wikimedia.org/T220020#11210195 (10taavi) 05Open→03Invalid This does not seem like a useful tracking task after 6 years :-)
[14:14:32] <wikibugs>	 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Engineering, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11210197 (10fnegri) @SD0001 apologies, I think I did something wrong yesterday and the change was n...
[14:15:44] <wikibugs>	 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [toolforge-cli.build] Implement a --json flag to output machine-readable output - https://phabricator.wikimedia.org/T334589#11210206 (10taavi)
[14:17:01] <wikibugs>	 (03open) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218
[14:29:56] <wikibugs>	 (03update) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218
[14:39:45] <wikibugs>	 (03open) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/219
[14:40:02] <wikibugs>	 (03close) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/219
[14:47:34] <wikibugs>	 (03open) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220
[14:51:09] <jinxer-wm>	 RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[14:54:22] <wikibugs>	 (03update) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136)
[14:54:25] <wikibugs>	 (03approved) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136)
[14:54:50] <wikibugs>	 (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 (https://phabricator.wikimedia.org/T403321)
[14:55:34] <wikibugs>	 (03update) 10damian: restart deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218 (https://phabricator.wikimedia.org/T403321)
[14:57:45] <wikibugs>	 (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 (https://phabricator.wikimedia.org/T403321)
[14:57:51] <wikibugs>	 (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595)
[14:59:38] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478 (10fgiunchedi) 03NEW
[14:59:48] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Ceph: [ceph,eqiad1] upgrade from quincy->reef (and bookworm) - https://phabricator.wikimedia.org/T404249#11210392 (10Andrew) 05Open→03Resolved ` root@cloudcephmon1004:~# ceph versions {     "mon": {         "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b5810...
[15:01:41] <wikibugs>	 (03open) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321)
[15:02:17] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[15:02:21] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99)
[15:03:34] <jinxer-wm>	 FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.99% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[15:03:36] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[15:03:38] <stashbot>	 andrew@cloudcumin1001: Failed to log message to wiki. Somebody should check the error logs.
[15:05:01] <wikibugs>	 (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 (https://phabricator.wikimedia.org/T403321)
[15:05:59] <wikibugs>	 (03update) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321)
[15:06:30] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11210428 (10Andrew)
[15:11:49] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11210449 (10Andrew) 1050 and 1051 won't be pooled immediately, they're being reserved for T405478
[15:13:34] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.976% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[15:15:45] <wikibugs>	 06cloud-services-team, 10Openstack-Magnum: ssh to cloud-vps 'utility' nodes (magnum, trove, octavia) - https://phabricator.wikimedia.org/T402317#11210466 (10Andrew) >   - Install Trove root keys on cloudcontrols, as octavia root keys are already >   This is now done for octavia, trove, and PAWs.  Is there dema...
[15:32:52] <wikibugs>	 (03open) 10damian: offload job restart logic to jobs-api [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/133 (https://phabricator.wikimedia.org/T403321)
[15:33:13] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [components-api] restart rather than delete/create continuous jobs - https://phabricator.wikimedia.org/T403321#11210549 (10DamianZaremba) Act 1: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/218 (doesn't solve this, but so...
[15:35:17] <wikibugs>	 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210572 (10dcaro) >>! In T404726#11210101, @MBH wrote: > I'm sorry, but what measurement unit is m?  That's 'milli-cpu', as in 1m = 0.001cpu (details here https://kubernetes...
[15:56:28] <wikibugs>	 (03update) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321)
[16:02:46] <wikibugs>	 06cloud-services-team, 10Data-Services, 06Community-Tech, 06Data-Platform-SRE, 10Multiblocks: Unexpected error "Subquery returns more than 1 row" on wiki replicas - https://phabricator.wikimedia.org/T404473#11210697 (10Ottomata)
[16:08:25] <wikibugs>	 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210721 (10DamianZaremba) I see the quota was bumped (https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/maintain-kubeusers/values/to...
[16:11:58] <wikibugs>	 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11210749 (10Sakretsu) The job got stuck again around 2025-09-22T21:20:59Z. I can't say if it's related to the NFS issue.  ` tools.itwiki@tools-bastion-15:~/draftbot$ kubectl get po...
[16:19:12] <icinga-wm>	 PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.014 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[16:21:10] <icinga-wm>	 RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 55.024 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[16:22:50] <wikibugs>	 (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220
[16:23:11] <wikibugs>	 (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220
[16:23:59] <wikibugs>	 (03update) 10damian: allow restarting job during patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/221 (https://phabricator.wikimedia.org/T403321)
[16:24:25] <wikibugs>	 (03update) 10damian: update deployment via template hash change [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220
[16:25:28] <wikibugs>	 (03update) 10raymond-ndibe: [deploy_task, tool_handlers] queue deployments to allow creation of multiple deployments at once [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/131 (https://phabricator.wikimedia.org/T402568)
[16:32:14] <wikibugs>	 06cloud-services-team, 10Toolforge: toolforge buildservice based tool is not starting - https://phabricator.wikimedia.org/T405319#11210896 (10dcaro) @santhosh can you retry now? Already changed the default resources for the cluster, things should go way smoother.
[16:34:07] <wikibugs>	 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11210924 (10dcaro) @DamianZaremba looking
[16:35:28] <wikibugs>	 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11210937 (10DamianZaremba) https://k8s-status.toolforge.org/namespaces/tool-itwiki/pods/itwiki-draftbot-continuous-76fcff44b5-5q6wc/ shows this on tools-k8s-worker-nfs-73 which doe...
[16:46:21] <wikibugs>	 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11211017 (10dcaro) >>! In T404726#11210924, @dcaro wrote: > @DamianZaremba looking  Done, I think it probably has updated it already, but if not it will take a minute.
[16:46:58] <wikibugs>	 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11211033 (10DamianZaremba) > Done, I think it probably has updated it already, but if not it will take a minute. Looks good, thanks
[16:49:11] <wikibugs>	 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11211054 (10dcaro) Yep, the process got stuck on NFS, restarting, will move the pod to a different node
[16:49:15] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [components-api] restart rather than delete/create continuous jobs - https://phabricator.wikimedia.org/T403321#11211055 (10DamianZaremba) @dcaro could you tag those as needing review when you have a min... sorry for picking you, but since this is ultima...
[16:50:40] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-73 (T400957)
[16:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:50:46] <stashbot>	 T400957: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957
[16:53:40] <wikibugs>	 (03update) 10raymond-ndibe: [deploy_task, tool_handlers] queue deployments to allow creation of multiple deployments at once [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/131 (https://phabricator.wikimedia.org/T402568)
[16:53:40] <wikibugs>	 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11211111 (10dcaro)
[16:55:13] <wm-bot2>	 !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983740752 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[16:55:14] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983737370 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[16:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[16:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[16:55:27] <wm-bot2>	 !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983743046 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[16:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL
[16:56:12] <wm-bot2>	 !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17983738823 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[16:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL
[16:57:20] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-73 (T400957)
[16:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:57:28] <stashbot>	 T400957: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957
[16:59:49] <wikibugs>	 (03update) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129
[17:00:12] <wikibugs>	 (03approved) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129
[17:03:39] <wikibugs>	 (03update) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129
[17:03:47] <wikibugs>	 (03merge) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129
[17:04:37] <wikibugs>	 (03open) 10dcaro: d/changelog: bump to 16.1.22 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/132 (https://phabricator.wikimedia.org/T390136)
[17:05:04] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli
[17:06:50] <wm-bot2>	 !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984009139 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[17:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL
[17:06:52] <wm-bot2>	 !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984009157 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[17:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[17:07:45] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-43
[17:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:08:12] <wm-bot2>	 !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984009170 (https://github.com/cluebotng/component-configs/commits/refs/heads/main)
[17:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL
[17:10:16] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-43
[17:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:13:16] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli
[17:23:56] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli
[17:28:50] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-43
[17:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:30:46] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-43
[17:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:32:43] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli
[17:32:46] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli
[17:38:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[17:40:02] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820831 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960)
[17:40:05] <wm-bot2>	 !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820847 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960)
[17:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[17:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL
[17:40:23] <wm-bot2>	 !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820837 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960)
[17:40:24] <wm-bot2>	 !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820843 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960)
[17:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[17:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL
[17:40:26] <wm-bot2>	 !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17984820840 (https://github.com/cluebotng/component-configs/commits/6f47ae931d95d85e2c3c1d6b42f1eabc6d3b1960)
[17:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL
[17:41:23] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli
[17:48:45] <wm-bot2>	 !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985011099 (https://github.com/cluebotng/component-configs/commits/f38bf48e73ce94da03cee36fb6cccbe483786e19)
[17:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL
[17:52:57] <wm-bot2>	 !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985077157 (https://github.com/cluebotng/component-configs/commits/955cd61533af834d56d36117a81643d8ab9ba81f)
[17:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL
[17:52:58] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985077117 (https://github.com/cluebotng/component-configs/commits/955cd61533af834d56d36117a81643d8ab9ba81f)
[17:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[17:57:29] <wm-bot2>	 !log tools.cluebotng-monitoring Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198507 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4)
[17:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-monitoring/SAL
[17:57:36] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198515 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4)
[17:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[17:57:38] <wm-bot2>	 !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198509 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4)
[17:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL
[17:58:35] <wm-bot2>	 !log tools.cluebotng-review Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198575 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4)
[17:58:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-review/SAL
[17:59:51] <wm-bot2>	 !log tools.cluebotng-staging Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985198519 (https://github.com/cluebotng/component-configs/commits/cfa2541734b05a9da326bbeab2e82cc21d6e91e4)
[17:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL
[18:03:03] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[18:05:00] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985333827 (https://github.com/cluebotng/component-configs/commits/24f56031bca40f808e05920eb128fd892a3d9b92)
[18:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:05:18] <wm-bot2>	 !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985333851 (https://github.com/cluebotng/component-configs/commits/24f56031bca40f808e05920eb128fd892a3d9b92)
[18:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL
[18:07:27] <wikibugs>	 (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro)
[18:08:00] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985386789 (https://github.com/cluebotng/component-configs/commits/a1633fc7ecf1bd4a310fcb03d02d0da81212427a)
[18:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:11:13] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985446508 (https://github.com/cluebotng/component-configs/commits/34320f16093c0d4160ce346108bc14bfb95245f8)
[18:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:14:47] <wikibugs>	 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211409 (10DamianZaremba) Pretty sure I just observed image caching in the wild ` tools.cluebotng-trainer@tools-bastion-15:~$ toolforge build show Build ID: cluebotng-trainer-buildpacks-pipelinerun...
[18:15:27] <wikibugs>	 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211410 (10DamianZaremba) I'll try and have a look at getting this done once T403321 is merged.
[18:19:54] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985728221 (https://github.com/cluebotng/component-configs/commits/34320f16093c0d4160ce346108bc14bfb95245f8)
[18:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:23:06] <wikibugs>	 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211459 (10DamianZaremba) ` tools.cluebotng-trainer@tools-bastion-15:~$ toolforge components deployment show Warning: You are using a beta feature of Toolforge. Deployment ID: 20250924-181937-2kh9i...
[18:25:20] <wikibugs>	 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211463 (10DamianZaremba) Ah, so this is because the component changed from `trainer` to `coordinator` but the previous `CronJob` was not deleted... so false alarm on caching
[18:26:24] <wikibugs>	 06cloud-services-team, 10Toolforge: [builds-api] return image digest - https://phabricator.wikimedia.org/T403322#11211467 (10DamianZaremba) Just for any casual observer, it is actually working; ` tools.cluebotng-trainer@tools-bastion-15:~$ kubectl create job --from=cronjob/coordinator coordinator job.batch/coo...
[18:31:28] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17985989347 (https://github.com/cluebotng/component-configs/commits/38b2f2235d143f0522d560a93c04e50e8fa38c77)
[18:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:34:27] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17986015781 (https://github.com/cluebotng/component-configs/commits/b0f088ec49a2156a18ad7d78d18640f6e8fe943c)
[18:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:36:30] <wm-bot2>	 !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/17986145339 (https://github.com/cluebotng/component-configs/commits/820291b11edc6256cfa07e9a4a73677df86e52d8)
[18:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL
[18:40:16] <wikibugs>	 (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro)
[18:40:18] <wikibugs>	 (03approved) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro)
[18:48:52] <wikibugs>	 (03approved) 10raymond-ndibe: jobs: lower the default cpu limit to 1cpu [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/217 (owner: 10dcaro)
[18:51:37] <wikibugs>	 (03update) 10raymond-ndibe: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 (owner: 10dcaro)
[18:55:03] <wikibugs>	 (03update) 10raymond-ndibe: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 (owner: 10dcaro)
[19:06:14] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE, 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11211573 (10cmooney) I can confirm the switch is already set to accept tagged traffic for t...
[19:07:09] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11211578 (10Andrew) Here's another example: Cloudcephosd1016 is using around 29G or RAM according to grafana.   `  -7          13.97253          host cloudcephosd1016...
[19:31:30] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11211697 (10Andrew) So... it doesn't look like ceph will make use of more RAM even if we offer it up. If it did, changing the cluster-wide setting from 6 to 8 would run the risk of...
[19:48:59] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11211733 (10Andrew) For now I'm leaving codfw1 with osd_memory_target_autotune=true and eqiad1 with osd_memory_target_autotune=false and osd_memory_target=6442450944 -- I'm not con...
[20:03:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[20:13:09] <wikibugs>	 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE, 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11211785 (10Andrew) @fgiunchedi, 1050 and 1051 should already be fully puppetized with Ceph...
[20:14:32] <wikibugs>	 (03PS1) 10Andrew Bogott: inventory: update expected ceph version to 18 [cloud/wmcs-cookbooks] - 10https:[23:08:58] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Update maintain_kubeusers to use the toolstate database - https://phabricator.wikimedia.org/T334629#11212421 (10Raymond_Ndibe)
[23:18:44] <wikibugs>	 (03PS1) 10Stevemunene: idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073)
[23:30:18] <wikibugs>	 (03open) 10damian: getBuild - add digest to image [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/144 (https://phabricator.wikimedia.org/T403322)