[06:09:03] (03CR) 10Ayounsi: [V:03+2 C:03+2] "Real password added according to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180856/comments/6d1cac5c_3c56b286" [labs/private] - 10https://gerrit.wikimedia.org/r/1180855 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [08:10:18] (03CR) 10Essa237: [C:03+1] add seamless multiple searches without needing to refresh or remount the component [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1180997 (https://phabricator.wikimedia.org/T397019) (owner: 10Jacob4code) [08:28:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [10:03:21] !log taavi@cloudcumin1001 bastion START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'bastion-eqiad1' [10:04:53] 06cloud-services-team: SystemdUnitDown The systemd unit wmf_auto_restart_systemd-timesyncd.service on node cloudnet1005 has been failing for more than two hours. - https://phabricator.wikimedia.org/T402575#11110040 (10taavi) 05Open→03Resolved a:03Andrew [10:12:04] !log taavi@cloudcumin1001 bastion END (ERROR) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=97) with prefix 'bastion-eqiad1' [10:12:45] !log taavi@cloudcumin1001 bastion START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'bastion-eqiad1' [10:12:48] 06cloud-services-team, 10Cloud-VPS (Debian Bullseye Deprecation), 07IPv6: Refresh Cloud VPS bastions to run on Trixie and enable IPv6 - https://phabricator.wikimedia.org/T392689#11110085 (10taavi) a:03taavi [10:25:32] (03open) 10samwilson: Update repo URL in toolinfo.json [toolforge-repos/wsexport] - 10https://gitlab.wikimedia.org/toolforge-repos/wsexport/-/merge_requests/3 [10:25:32] !log taavi@cloudcumin1001 bastion END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'bastion-eqiad1' [10:31:47] !log taavi@cloudcumin1001 bastion START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'bastion-eqiad1' [10:45:36] !log taavi@cloudcumin1001 bastion END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'bastion-eqiad1' [11:24:50] (03update) 10samwilson: Update repo URL in toolinfo.json [toolforge-repos/wsexport] - 10https://gitlab.wikimedia.org/toolforge-repos/wsexport/-/merge_requests/3 [12:26:30] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [jobs-api] make job status an enum, with clearly defined states - https://phabricator.wikimedia.org/T401172#11110418 (10DamianZaremba) This sounds like a good improvement. Just a question regarding `inconsistent`/`up_to_date` - I can't quite parse "... [13:07:27] 06cloud-services-team, 10Toolforge: [Build service] latest builder has old PHP - https://phabricator.wikimedia.org/T401875#11110508 (10DamianZaremba) ` [step-build] 2025-08-22T12:49:38.860499727Z [Installing platform packages] [step-build] 2025-08-22T12:49:39.203935919Z No composer.lock file present. Updating... [13:14:18] FIRING: KernelErrors: Server cloudcephosd1048 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1048 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:14:29] 06cloud-services-team: KernelErrors Server cloudcephosd1048 logged kernel errors - https://phabricator.wikimedia.org/T402646 (10phaultfinder) 03NEW [13:18:48] 06cloud-services-team, 10Toolforge: `toolforge build start` returns success status on build failure - https://phabricator.wikimedia.org/T402648 (10DamianZaremba) 03NEW [13:20:55] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [13:21:43] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [13:27:19] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [13:44:06] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [13:45:20] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] Allow reusing another component build - https://phabricator.wikimedia.org/T401893#11110676 (10DamianZaremba) I think https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 is good for review. Will contin... [13:48:49] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [14:15:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T395910) [14:15:23] T395910: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910 [14:18:18] PROBLEM - Host cloudcephosd1048 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:42] RECOVERY - Host cloudcephosd1048 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [14:20:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T395910) [14:20:24] T395910: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910 [14:22:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:24:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T395910) [14:26:53] (03open) 10damian: Add basic instructions for deploying into lima-kilo [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/120 [14:27:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T395910) [14:28:09] T395910: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910 [14:29:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [14:29:54] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) [14:29:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [14:31:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [14:31:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 931 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:31:18] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 931 slow ops - https://phabricator.wikimedia.org/T402656 (10phaultfinder) 03NEW [14:32:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:35:30] PROBLEM - Host cloudcephosd1049 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 931 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:36:12] RECOVERY - Host cloudcephosd1049 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:39:45] FIRING: Primary cloud switch port utilisation over 80%: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary cloud switch port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+port+utilisation+over+80%25 [14:39:46] FIRING: Primary cloud switch inbound port utilisation over 80%: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary cloud switch inbound port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+inbound+port+utilisation+over+80%25 [14:39:50] 06cloud-services-team: Primary cloud switch inbound port utilisation over 80% Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary cloud switch inbound port utilisation over 80% - https://phabricator.wikimedia.org/T402658 (10phaultfinder) 03NEW [14:39:51] 06cloud-services-team: Primary cloud switch port utilisation over 80% Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary cloud switch port utilisation over 80% - https://phabricator.wikimedia.org/T402657 (10phaultfinder) 03NEW [14:40:39] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [14:44:45] RESOLVED: Primary cloud switch port utilisation over 80%: Device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet recovered from Primary cloud switch port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+port+utilisation+over+80%25 [14:44:46] RESOLVED: Primary cloud switch inbound port utilisation over 80%: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Primary cloud switch inbound port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+inbound+port+utilisation+over+80%25 [15:01:32] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11110921 (10Andrew) p:05Medium→03Low a:05Andrew→03fnegri @fnegri investigated this and determined that it's not important. I'll let him summarize here. [15:03:52] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11110928 (10fnegri) 05Open→03Resolved I think the errors were a race condition while Puppet was installing the `lvm2` package: Puppet started installing things at 02:13:17: ` Aug 21 02:13:17 cloudcephosd1048 puppet-agent[34... [15:04:44] (03update) 10damian: Allow re-using builds across components [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T401893) [15:14:59] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11110964 (10fnegri) Ok I have a smoking gun for the race condition. The script file (`lvm2-activation-generator`) and the config file (`lvm.conf`) are both installed by the `lvm2` package, but they were written a few seconds apar... [15:26:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [15:34:13] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Catalyst: Quota increase request for catalyst-dev - https://phabricator.wikimedia.org/T402521#11111029 (10thcipriani) The idea behind the quota increase is to use our real workloads within our staging. The thinking behind parity is to pull over our exis... [15:49:41] 06cloud-services-team: KernelErrors Server cloudcephosd1048 logged kernel errors - https://phabricator.wikimedia.org/T402646#11111064 (10fnegri) 05Open→03Resolved a:03fnegri I triggered these errors while debugging {T402475}. I tried to run the script manually but I got the arguments wrong. ` Aug 22 1... [17:05:03] (03open) 10damian: kubectl alias - use blockinfile [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/262 [17:22:00] (03update) 10damian: kubectl alias - use blockinfile [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/262 [17:22:16] (03update) 10damian: kubectl alias - use blockinfile [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/262 [17:23:27] (03open) 10damian: install-binary-from-url - add checksums for dest [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/263 [17:28:31] (03update) 10damian: install-binary-from-url - add checksums for dest [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/263 [17:30:20] 06cloud-services-team, 10Openstack-Magnum: ssh to cloud-vps 'utility' nodes (magnum, trove, octavia) - https://phabricator.wikimedia.org/T402317#11111293 (10Andrew) I documented the VM types here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_access#Trove_database_instances There are a few ch... [17:35:44] 06cloud-services-team, 10Toolforge: [lima-kilo] Improve convergence - https://phabricator.wikimedia.org/T402672 (10DamianZaremba) 03NEW [17:42:22] (03open) 10damian: docker - move restart to handler [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/264 [17:44:14] 10Cloud-VPS (Project-requests): Request creation of eseap VPS project - https://phabricator.wikimedia.org/T401957#11111358 (10Robertsky) I am aware of the extra effort and responsibilities required to self manage and willing to take it up. [17:53:49] (03open) 10damian: harbor - only download and setup once [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/265 [17:54:39] (03update) 10damian: harbor - only download and setup once [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/265 [17:55:55] (03update) 10damian: harbor - only download and setup once [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/265 [18:03:26] (03open) 10damian: deploy components - don't report as changed [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/266 [18:18:09] (03open) 10damian: harbor - move restart to handler [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/267 [18:19:10] (03update) 10damian: docker - move restart to handler [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/264 [18:26:16] (03open) 10damian: tool home dir - update permissions [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/268 [18:48:24] 06cloud-services-team, 10Toolforge: [lima-kilo] Improve convergence - https://phabricator.wikimedia.org/T402672#11111582 (10DamianZaremba) Initial changes: * https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/262 - prevent duplicate bashrc entries * https://gitlab.wikimedia.org/repos... [18:48:46] 06cloud-services-team, 10Toolforge: [lima-kilo] Improve convergence - https://phabricator.wikimedia.org/T402672#11111584 (10DamianZaremba) Creating sub-tasks for review [18:51:29] 06cloud-services-team, 10Toolforge: kubectl alias and auto-complete is duplicated - https://phabricator.wikimedia.org/T402683 (10DamianZaremba) 03NEW [18:53:55] 06cloud-services-team, 10Toolforge: Only download artefacts if target binary checksum does not match - https://phabricator.wikimedia.org/T402684 (10DamianZaremba) 03NEW [18:55:28] 06cloud-services-team, 10Toolforge: Only download & setup harbor once - https://phabricator.wikimedia.org/T402685 (10DamianZaremba) 03NEW [18:57:19] 06cloud-services-team, 10Toolforge: Only restart docker if the config has changed - https://phabricator.wikimedia.org/T402687 (10DamianZaremba) 03NEW [19:01:27] 06cloud-services-team, 10Toolforge: tool dirs are created with different permissions than maintain-kubeusers - https://phabricator.wikimedia.org/T402688 (10DamianZaremba) 03NEW [19:04:12] 06cloud-services-team, 10Toolforge: toolforge components are reported as changed on every run - https://phabricator.wikimedia.org/T402689 (10DamianZaremba) 03NEW [19:05:46] 06cloud-services-team, 10Toolforge: Only download & setup harbor once - https://phabricator.wikimedia.org/T402685#11111695 (10DamianZaremba) I imagine this is deployed from a container image in production, so perhaps it should move into a component like setup similar to foxtrot-ldap, rather than the current sc... [19:06:24] 06cloud-services-team, 10Toolforge: [lima-kilo] Improve convergence - https://phabricator.wikimedia.org/T402672#11111697 (10DamianZaremba) Creating sub-tasks for review, let me know if you want to discuss any of this in more detail. [19:13:40] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] Allow reusing another component build - https://phabricator.wikimedia.org/T401893#11111725 (10DamianZaremba) Managed to get lima-kilo built. Both manual tests and the functional tests seem to be good: * https://gitlab.wikimedia.org/rep... [19:34:04] (03PS1) 10Andrew Bogott: Add new dummy ssh keys for trove VMs [labs/private] - 10https://gerrit.wikimedia.org/r/1181184 (https://phabricator.wikimedia.org/T402317) [19:35:14] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add new dummy ssh keys for trove VMs [labs/private] - 10https://gerrit.wikimedia.org/r/1181184 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [19:57:30] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11111861 (10Andrew) [20:27:14] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [20:34:18] FIRING: KernelErrors: Server cloudcephosd1048 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1048 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [20:34:30] 06cloud-services-team: KernelErrors Server cloudcephosd1048 logged kernel errors - https://phabricator.wikimedia.org/T402699 (10phaultfinder) 03NEW [20:44:16] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 07Epic, 10Moderator-Tools-Team (Kanban): hashtags: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402056#11111982 (10jsn.sherman) WIP progress patch that you can follow: https://github.com/WikipediaLibrary... [23:43:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.905% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace