[01:51:41] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:51:42] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:02:45] 10Data-Services, 06DBA: Prepare and check storage layer for nrwiki - https://phabricator.wikimedia.org/T375101#10159503 (10ABran-WMF) a:03ABran-WMF [07:02:50] 10Data-Services, 06DBA: Prepare and check storage layer for gorwikiquote - https://phabricator.wikimedia.org/T375094#10159504 (10ABran-WMF) a:03ABran-WMF [07:02:57] 10Data-Services, 06DBA: Prepare and check storage layer for madwiktionary - https://phabricator.wikimedia.org/T375023#10159505 (10ABran-WMF) a:03ABran-WMF [07:03:11] 10Data-Services, 06DBA: Prepare and check storage layer for rskwiki - https://phabricator.wikimedia.org/T375016#10159506 (10ABran-WMF) a:03ABran-WMF [07:03:18] 10Data-Services, 06DBA: Prepare and check storage layer for kgewiki - https://phabricator.wikimedia.org/T374814#10159507 (10ABran-WMF) a:03ABran-WMF [07:05:26] 06cloud-services-team, 10Cloud-VPS: sssd permanent failure on integration-agent-docker-1029 - https://phabricator.wikimedia.org/T324934#10159508 (10hashar) 05Open→03Resolved I went on `integration-cumin.integration.eqiad1.wikimedia.cloud` to find an instance that had the same error using: ` $ sudo cumi... [07:40:34] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit on Github - https://phabricator.wikimedia.org/T375079#10159544 (10Aklapper) 05Declined→03Open Looks like I totally misremembered `T321402#8546665` (manual option available) thus reopening. Sorry! [07:52:00] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates it's policies - https://phabricator.wikimedia.org/T375157 (10Raymond_Ndibe) 03NEW [07:52:09] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates it's policies - https://phabricator.wikimedia.org/T375157#10159593 (10Raymond_Ndibe) a:03Raymond_Ndibe [07:52:34] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157#10159596 (10Raymond_Ndibe) [07:53:03] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157#10159597 (10Raymond_Ndibe) [07:53:41] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157#10159598 (10Raymond_Ndibe) [08:00:26] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157#10159604 (10aborrero) [08:07:39] 10Toolforge (Toolforge iteration 14): add --force to wmcs.toolforge.remove_k8s_node cookbook - https://phabricator.wikimedia.org/T375158 (10Raymond_Ndibe) 03NEW [08:09:05] 10Toolforge (Toolforge iteration 14): add --force to wmcs.toolforge.remove_k8s_node cookbook - https://phabricator.wikimedia.org/T375158#10159626 (10Raymond_Ndibe) [08:09:27] 10Toolforge (Toolforge iteration 14): add --force to wmcs.toolforge.remove_k8s_node cookbook - https://phabricator.wikimedia.org/T375158#10159627 (10Raymond_Ndibe) a:03Raymond_Ndibe [08:13:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:13:30] 06cloud-services-team, 10Cloud-VPS: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10159628 (10aborrero) 05Open→03In progress p:05Triage→03Medium [08:39:46] 10Toolforge (Toolforge iteration 14): add --force to wmcs.toolforge.remove_k8s_node cookbook - https://phabricator.wikimedia.org/T375158#10159694 (10Raymond_Ndibe) [08:47:11] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:52:49] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit on Github - https://phabricator.wikimedia.org/T375079#10159735 (10Ebrahim) [08:52:59] 10Toolforge (Toolforge iteration 14): lima-kilo installation giving inconsistent result. Sometimes it works, sometimes it doesn't - https://phabricator.wikimedia.org/T375163 (10Raymond_Ndibe) 03NEW [08:59:06] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [08:59:31] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [08:59:51] (03PS1) 10Ebrahim: Add mw-node-qunit project [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1074107 (https://phabricator.wikimedia.org/T375079) [09:07:08] (03open) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:07:16] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:07:25] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:07:47] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:07:47] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:08:02] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:10:44] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:10:45] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:11:19] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:16:44] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:17:54] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:18:23] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:21:01] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:24:00] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:25:18] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:25:51] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:27:06] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:27:07] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [09:27:35] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 [09:35:29] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10159868 (10aborrero) >>! In T375111#10159813, @CodeReviewBot wrote: > aborrero opened https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_... [09:48:55] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10159895 (10aborrero) So I think the strategy could be: * Assume neutron will: always create a default SG, always self manage this group, with the very basi... [09:50:19] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10159899 (10taavi) Anything left to do here? Or can this task be closed? [09:55:41] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node (T374043) [09:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:55:46] T374043: Drain C8 rack - https://phabricator.wikimedia.org/T374043 [09:56:31] !log dcaro@urcuchillay admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T374043) [09:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:03:00] (03open) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/51 (https://phabricator.wikimedia.org/T375111) [10:03:01] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/51 [10:03:32] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/51 [10:05:48] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [10:06:12] 10Cloud-VPS, 10observability, 10Observability-Logging, 06SRE, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10159964 (10fgiunchedi) [10:06:25] 06cloud-services-team, 10Cloud-VPS, 10observability, 06SRE, 13Patch-For-Review: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623#10159965 (10fgiunchedi) [10:07:01] 10Cloud-VPS, 10observability, 10Observability-Logging, 06SRE, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10159968 (10fgiunchedi) [10:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:21:24] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10160017 (10cmooney) 05Open→03Resolved Things seem stable with this now so I will close the task. [10:21:55] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10160020 (10aborrero) mmm not that easy. When creating a VM, they will be automatically added to the default secgroup. If we don't have the SSH rules in thi... [10:22:06] (03update) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [10:51:40] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for SD0001 - https://phabricator.wikimedia.org/T374998#10160109 (10Lferreira) I support the nomination of @SD0001 [10:52:35] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Lucas Werkmeister - https://phabricator.wikimedia.org/T375001#10160111 (10Lferreira) [10:55:51] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for SD0001 - https://phabricator.wikimedia.org/T374998#10160130 (10Lferreira) I support the nomination of @SD0001 [10:56:14] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Lucas Werkmeister - https://phabricator.wikimedia.org/T375001#10160132 (10Lferreira) I support the nomination of @LucasWerkmeister [10:56:46] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for TheProtonade - https://phabricator.wikimedia.org/T375007#10160137 (10Lferreira) I support the nomination of @theprotonade [10:57:08] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for JJMC89 - https://phabricator.wikimedia.org/T375041#10160140 (10Lferreira) I support the nomination of @JJMC89 [10:57:46] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Waldir Pimenta (Waldyrious) - https://phabricator.wikimedia.org/T375110#10160141 (10Lferreira) I support the nomination of @waldyrious [10:58:00] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Antonin Delpeuch (Pintoch) - https://phabricator.wikimedia.org/T374995#10160144 (10Lferreira) I support the nomination of @Pintoch [11:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:04:09] (03PS1) 10Btullis: Add dummy secrets for the rclone backup on db1208 [labs/private] - 10https://gerrit.wikimedia.org/r/1074143 (https://phabricator.wikimedia.org/T372908) [11:21:38] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy secrets for the rclone backup on db1208 [labs/private] - 10https://gerrit.wikimedia.org/r/1074143 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [12:00:41] 10Toolforge-standards-committee (Maintainer needed), 10Tools, 10Wikidata: Bring Bene's sparql tool back to life - https://phabricator.wikimedia.org/T223858#10160385 (10Pintoch) @waldyrious hello! I'm looking at the backlog of tools needing maintainers in https://phabricator.wikimedia.org/project/view/2952/ a... [12:02:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10160404 (10dcaro) >>! In T359641#10159899, @taavi wrote: > Anything left to do here? Or can this task be closed? We were waitin... [12:17:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-db04 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:22:28] FIRING: [4x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:23:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance runner-1029 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:24:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance proxy-04 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:26:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-etcd-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:28:00] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-redis-5 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:31:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:33:00] FIRING: [5x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:33:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance runner-1027 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:34:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance project-proxy-acme-chief-02 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:36:28] FIRING: [5x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:37:28] FIRING: [5x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:38:00] FIRING: [6x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:38:28] FIRING: [5x] PuppetAgentNoResources: No Puppet resources found on instance runner-1025 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:39:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance project-proxy-acme-chief-02 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:41:28] FIRING: [8x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:42:28] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:43:00] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:43:28] FIRING: [6x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:45:24] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10160484 (10aborrero) Let me try to put things a bit more clear on what the problem is, and what we would like to achieve. === Use cases === Use case 1: 1.... [12:46:28] FIRING: [11x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:47:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:48:00] FIRING: [8x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:48:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:51:28] FIRING: [12x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:52:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:53:00] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:53:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:53:58] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10160510 (10VRiley-WMF) Is there an acceptable time to swap out the DIMM? We can proceed at any time. [12:54:28] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance project-proxy-acme-chief-02 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:56:28] FIRING: [14x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:57:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:58:00] FIRING: [11x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:58:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:00:41] 10Toolforge-standards-committee (Maintainer needed): wikistream.toolforge.org needs new maintainers - https://phabricator.wikimedia.org/T251555#10160564 (10Pintoch) p:05Triage→03Low I'm tagging this as low priority given that there are alternatives such as https://tools.wmflabs.org/event-streams as mentioned... [13:01:28] FIRING: [12x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:02:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:03:00] FIRING: [9x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:03:28] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:06:28] FIRING: [11x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:07:28] RESOLVED: [10x] PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-acme-chief-02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:08:00] FIRING: [6x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-legacy-redirector-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:08:28] FIRING: [8x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:09:28] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance proxy-03 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:11:28] RESOLVED: [9x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:13:00] RESOLVED: [5x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-legacy-redirector-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:18:28] RESOLVED: [5x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:28:47] 10wikitech.wikimedia.org: Add.svg image shows up as “Add patch” redlink on Wikitech deployment calendar - https://phabricator.wikimedia.org/T375193 (10Lucas_Werkmeister_WMDE) 03NEW [13:32:56] 10wikitech.wikimedia.org: Add.svg image shows up as “Add patch” redlink on Wikitech deployment calendar - https://phabricator.wikimedia.org/T375193#10160765 (10Lucas_Werkmeister_WMDE) Hm, after a purge it went away again. Transient issue? Will close later if nothing else comes up. [13:41:26] (03approved) 10raymond-ndibe: [jobs-cli] update autocomplete and man files [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/66 [13:41:42] (03merge) 10raymond-ndibe: [jobs-cli] update autocomplete and man files [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/66 [13:41:45] (03update) 10raymond-ndibe: [jobs-cli] remove unknown keys from dump [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/64 (https://phabricator.wikimedia.org/T341066) [13:45:04] (03approved) 10raymond-ndibe: [jobs-cli] remove unknown keys from dump [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/64 (https://phabricator.wikimedia.org/T341066) [13:51:31] (03merge) 10raymond-ndibe: [jobs-cli] remove unknown keys from dump [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/64 (https://phabricator.wikimedia.org/T341066) [13:51:32] (03update) 10raymond-ndibe: [jobs-cli] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/63 (https://phabricator.wikimedia.org/T341066) [13:52:50] 10wikitech.wikimedia.org: Add.svg image shows up as “Add patch” redlink on Wikitech deployment calendar - https://phabricator.wikimedia.org/T375193#10160866 (10Lucas_Werkmeister_WMDE) 05Open→03Invalid 🤷 [13:54:25] (03approved) 10raymond-ndibe: [jobs-api] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/115 (https://phabricator.wikimedia.org/T341066) [13:57:40] (03merge) 10raymond-ndibe: [jobs-api] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/115 (https://phabricator.wikimedia.org/T341066) [14:00:15] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: jobs-api: bump to 0.0.336-20240919135748-c8ffa589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/523 (https://phabricator.wikimedia.org/T341066) [14:04:02] 10Data-Services, 06Data-Engineering, 06SRE, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10160898 (10fnegri) This requires a change to the wiki replicas view definition... [14:04:10] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [Hypothesis] WE6.3.4 By building an "orchestrator" toolforge component (components-api) we will be able to automate most manually-triggered deployments - https://phabricator.wikimedia.org/T375199 (10dcaro) 03NEW [14:05:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [Hypothesis] WE6.3.4 By building an "orchestrator" toolforge component (components-api) we will be able to automate most manually-triggered deployments - https://phabricator.wikimedia.org/T375199#10160926 (10dcaro) [14:05:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [components-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#10160927 (10dcaro) [14:05:41] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [14:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:05:46] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [14:10:00] !log raymondndibe@wmf3402 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api (T341066) [14:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:10:10] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [Hypothesis] WE6.3.4 By building an "orchestrator" toolforge component (components-api) we will be able to automate most manually-triggered deployments - https://phabricator.wikimedia.org/T375199#10160942 (10dcaro) [14:15:15] 10Data-Services, 06Data-Engineering, 06SRE, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10160987 (10Dreamy_Jazz) >>! In T371486#10160897, @fnegri wrote: > @Ladsgroup c... [14:17:16] 10cloud-services-team (FY2024/2025-Q1-Q2): [cloud] Drain B row from cloud* services - https://phabricator.wikimedia.org/T374463#10160996 (10dcaro) 05Open→03Resolved [14:30:15] 10cloud-services-team (FY2024/2025-Q1-Q2): [cloudceph] Improve downtime when a switch goes down - https://phabricator.wikimedia.org/T375204 (10dcaro) 03NEW [14:31:00] 10cloud-services-team (FY2024/2025-Q1-Q2): [cloudceph] Improve downtime when a switch goes down - https://phabricator.wikimedia.org/T375204#10161108 (10dcaro) 05Open→03In progress p:05Triage→03High [14:31:21] (03open) 10raymond-ndibe: d/changelog: bump to 16.1.2 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/68 (https://phabricator.wikimedia.org/T341066) [14:31:30] (03approved) 10raymond-ndibe: d/changelog: bump to 16.1.2 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/68 (https://phabricator.wikimedia.org/T341066) [14:32:08] (03merge) 10raymond-ndibe: d/changelog: bump to 16.1.2 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/68 (https://phabricator.wikimedia.org/T341066) [14:36:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [components-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#10161176 (10dcaro) [14:37:15] (03PS1) 10Elukey: requestctl: modify comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074194 (https://phabricator.wikimedia.org/T374443) [14:37:25] (03CR) 10Elukey: [V:03+2 C:03+2] requestctl: modify comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074194 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [14:42:50] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [14:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:42:54] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [14:47:36] !log raymondndibe@wmf3402 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api (T341066) [14:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:50:41] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:50:45] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [14:52:58] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10161340 (10aborrero) some additional links, this is the neutron code that creates the default sg: * https://github.com/openstack/neutron/blob/08fff4087dc34... [14:55:59] !log raymondndibe@wmf3402 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api (T341066) [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:56:03] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [14:57:08] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [14:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:01:05] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10161385 (10fnegri) @aborrero thanks for all your investigation! I dug through a few docs and internet pages and maybe I found a potential approach (to be ve... [15:01:22] !log raymondndibe@wmf3402 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api (T341066) [15:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:01:29] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [15:07:32] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [15:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:07:36] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [15:08:40] !log raymondndibe@wmf3402 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-api (T341066) [15:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:18:16] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [15:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:18:20] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [15:19:34] !log raymondndibe@wmf3402 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-api (T341066) [15:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:27:11] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:27:15] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [15:28:20] !log raymondndibe@wmf3402 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-api (T341066) [15:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:31:28] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance (T373740) [15:31:33] T373740: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740 [15:32:05] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) (T373740) [15:33:12] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1048.eqiad.wmnet' (T373740) [15:43:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-24 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:44:33] (03open) 10pwangai: Gerrit message improvements [toolforge-repos/sonarqubebot-experimental] - 10https://gitlab.wikimedia.org/toolforge-repos/sonarqubebot-experimental/-/merge_requests/2 (https://phabricator.wikimedia.org/T373109) [15:44:47] (03merge) 10pwangai: Gerrit message improvements [toolforge-repos/sonarqubebot-experimental] - 10https://gitlab.wikimedia.org/toolforge-repos/sonarqubebot-experimental/-/merge_requests/2 (https://phabricator.wikimedia.org/T373109) [15:46:48] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1048.eqiad.wmnet' (T373740) [15:46:54] T373740: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740 [15:48:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-24 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:57:22] (03open) 10pwangai: Gerrit message improvements [toolforge-repos/sonarqubebot-experimental] - 10https://gitlab.wikimedia.org/toolforge-repos/sonarqubebot-experimental/-/merge_requests/3 (https://phabricator.wikimedia.org/T373109) [15:57:32] (03merge) 10pwangai: Gerrit message improvements [toolforge-repos/sonarqubebot-experimental] - 10https://gitlab.wikimedia.org/toolforge-repos/sonarqubebot-experimental/-/merge_requests/3 (https://phabricator.wikimedia.org/T373109) [16:00:48] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10161647 (10aborrero) the server has been drained, it should be ready to go at any time @VRiley-WMF thanks! [16:07:27] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:07:34] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [16:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:07:59] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:08:04] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [16:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:08:27] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:08:45] 06cloud-services-team: Complete upgrading WMCS bare metal hosts from Bullseye to Bookworm - https://phabricator.wikimedia.org/T375217 (10fnegri) 03NEW [16:08:52] 06cloud-services-team: Complete upgrading WMCS bare metal hosts from Bullseye to Bookworm - https://phabricator.wikimedia.org/T375217#10161692 (10fnegri) p:05Triage→03Low [16:09:27] 06cloud-services-team: Complete upgrading WMCS bare metal hosts from Bullseye to Bookworm - https://phabricator.wikimedia.org/T375217#10161697 (10fnegri) [16:09:28] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10161696 (10fnegri) [16:12:39] 06cloud-services-team: Complete upgrading WMCS bare metal hosts from Bullseye to Bookworm - https://phabricator.wikimedia.org/T375217#10161706 (10fnegri) [16:12:59] 06cloud-services-team: Complete upgrading WMCS bare metal hosts from Bullseye to Bookworm - https://phabricator.wikimedia.org/T375217#10161712 (10fnegri) [16:26:44] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [16:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:26:50] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [16:38:04] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [16:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:38:09] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [16:43:53] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [16:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:45:24] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-api [16:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:46:34] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [16:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:46:37] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [16:48:20] !log raymondndibe@wmf3402 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-api (T341066) [16:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:06:39] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:06:54] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api (T341066) [17:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:06:58] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [17:11:39] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:13:20] !log raymondndibe@wmf3402 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api (T341066) [17:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:13:24] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [17:15:13] (03approved) 10raymond-ndibe: jobs-api: bump to 0.0.336-20240919135748-c8ffa589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/523 (https://phabricator.wikimedia.org/T341066) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [17:15:20] (03merge) 10raymond-ndibe: jobs-api: bump to 0.0.336-20240919135748-c8ffa589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/523 (https://phabricator.wikimedia.org/T341066) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [17:16:20] (03update) 10raymond-ndibe: [jobs-cli] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/63 (https://phabricator.wikimedia.org/T341066) [17:17:27] (03merge) 10raymond-ndibe: [jobs-cli] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/63 (https://phabricator.wikimedia.org/T341066) [17:20:37] (03update) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [17:26:04] (03open) 10raymond-ndibe: d/changelog: bump to 16.1.3 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/69 (https://phabricator.wikimedia.org/T341066) [17:26:17] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli (T341066) [17:26:20] !log raymondndibe@wmf3402 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli (T341066) [17:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:26:21] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [17:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:27:05] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli (T341066) [17:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:27:28] !log raymondndibe@wmf3402 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli (T341066) [17:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:28:52] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli (T341066) [17:29:27] PROBLEM - Host cloudvirt1048 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:47] FIRING: NodeDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1048 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [17:33:55] 06cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T375223 (10phaultfinder) 03NEW [17:34:04] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli (T341066) [17:34:30] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [17:35:01] (03update) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [17:35:16] (03approved) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [17:35:21] (03merge) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [17:35:57] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli (T341066) [17:38:57] RECOVERY - Host cloudvirt1048 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:41:08] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli (T341066) [17:41:09] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10162201 (10VRiley-WMF) 05Open→03Resolved This DIMM (B2) has been swapped out. Please let us know if any other issue crops up. [17:41:12] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [17:41:44] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli (T341066) [17:44:26] 10Toolforge (Toolforge iteration 14): [maintain-dbusers] When it stops working (ex. nfs got stuck), it still replies as ok to prometheus - https://phabricator.wikimedia.org/T375224 (10dcaro) 03NEW [17:44:27] 10Toolforge (Toolforge iteration 14): [maintain-dbusers] When it stops working (ex. nfs got stuck), it still replies as ok to prometheus - https://phabricator.wikimedia.org/T375224#10162218 (10dcaro) p:05Triage→03Medium [17:44:33] 10Toolforge (Toolforge iteration 14): lima-kilo installation giving inconsistent result. Sometimes it works, sometimes it doesn't - https://phabricator.wikimedia.org/T375163#10162219 (10dcaro) p:05Triage→03Medium [17:44:39] 10Toolforge (Toolforge iteration 14): add --force to wmcs.toolforge.remove_k8s_node cookbook - https://phabricator.wikimedia.org/T375158#10162220 (10dcaro) p:05Triage→03Medium [17:44:46] 10Toolforge (Toolforge iteration 14): kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157#10162221 (10dcaro) p:05Triage→03High [17:44:56] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585#10162222 (10dcaro) p:05Triage→03High [17:44:58] 10Toolforge (Toolforge iteration 14): [jobs-api] prepend date and pod name to filelog lines - https://phabricator.wikimedia.org/T372025#10162224 (10dcaro) p:05Triage→03Medium [17:45:05] 10Toolforge (Toolforge iteration 14): Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621#10162225 (10dcaro) p:05Triage→03High [17:45:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 07Epic: [Hypothesis] WE6.3.4 By building an "orchestrator" toolforge component (components-api) we will be able to automate most manually-triggered deployments - https://phabricator.wikimedia.org/T375199#10162226 (10dcaro) p:05... [17:47:11] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 07Epic: [Hypothesis] WE6.3.4 By building an "orchestrator" toolforge component (components-api) we will be able to automate most manually-triggered deployments - https://phabricator.wikimedia.org/T375199#10162229 (10dcaro) [17:47:30] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli (T341066) [17:47:34] T341066: [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066 [17:50:17] RESOLVED: NodeDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1048 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [17:55:52] (03approved) 10raymond-ndibe: d/changelog: bump to 16.1.3 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/69 (https://phabricator.wikimedia.org/T341066) [17:55:57] (03merge) 10raymond-ndibe: d/changelog: bump to 16.1.3 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/69 (https://phabricator.wikimedia.org/T341066) [17:58:47] 10Tool-quickcategories, 10Toolforge: Relax restrictions on toolforge envvar names - https://phabricator.wikimedia.org/T374780#10162297 (10dcaro) Lowercase might be ok to include yep, though `.` and `-` are not valid bash variable characters, so that would not be possible (even though they are valid k8s secrets... [18:20:36] (03update) 10raymond-ndibe: [toolforge-deploy] upgrade metrics-server [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/520 (https://phabricator.wikimedia.org/T359641) [19:50:07] 10Tool-quickcategories, 10Toolforge: Relax restrictions on toolforge envvar names - https://phabricator.wikimedia.org/T374780#10162547 (10LucasWerkmeister) Do we have to be bound by Bash’s syntax limitations? `.` and `-` work just fine between `env` and Flask / Python, even if there is a Bash in between: `lan... [19:56:17] 10Tool-video-answer-tool, 06Future-Audiences, 07Spike: Investigate different options for animation of images - https://phabricator.wikimedia.org/T374367#10162555 (10etz) a:03etz [22:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:35:37] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (T359641) [22:35:37] !log raymondndibe@wmf3402 tools Updating container image docker-registry.tools.wmflabs.org/docker-registry.tools.wmflabs.org/metrics-server:v0.7.1 (T359641) [22:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:35:43] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [22:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:35:47] !log raymondndibe@wmf3402 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=97) (T359641) [22:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:36:08] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (T359641) [22:36:09] !log raymondndibe@wmf3402 tools Updating container image docker-registry.tools.wmflabs.org/metrics-server:v0.7.1 (T359641) [22:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:36:19] !log raymondndibe@wmf3402 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=99) (T359641) [22:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:37:30] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (T359641) [22:37:30] !log raymondndibe@wmf3402 tools Updating container image docker-registry.tools.wmflabs.org/metrics-server:v0.7.1 (T359641) [22:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:38:08] !log raymondndibe@wmf3402 tools END (PASS) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=0) (T359641) [22:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:03:29] 10Tool-paulina: Public domain calculator module - https://phabricator.wikimedia.org/T375247 (10Pepe_piton) 03NEW [23:03:45] 10Tool-paulina: Public domain calculator module - https://phabricator.wikimedia.org/T375247#10163040 (10Pepe_piton) p:05Triage→03Medium [23:04:25] 10Tool-paulina: Public domain calculator module - https://phabricator.wikimedia.org/T375247#10163041 (10Pepe_piton) a:03Pepe_piton [23:11:49] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (T359641) [23:11:49] !log raymondndibe@wmf3402 tools Updating container image docker-registry.tools.wmflabs.org/kube-state-metrics:v2.10.1 (T359641) [23:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:11:53] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [23:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:12:25] !log raymondndibe@wmf3402 tools END (PASS) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=0) (T359641) [23:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:17:02] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (T359641) [23:17:02] !log raymondndibe@wmf3402 tools Updating container image docker-registry.tools.wmflabs.org/metrics-server:v0.7.10 (T359641) [23:17:04] !log raymondndibe@wmf3402 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=97) (T359641) [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:17:06] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [23:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:57:50] (03update) 10raymond-ndibe: [toolforge-deploy] upgrade metrics-server [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/520 (https://phabricator.wikimedia.org/T359641) [23:59:37] (03update) 10raymond-ndibe: [toolforge-deploy] upgrade wmcs-k8s-metrics [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/520 (https://phabricator.wikimedia.org/T359641) [23:59:47] (03update) 10raymond-ndibe: [toolforge-deploy] upgrade wmcs-k8s-metrics [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/520 (https://phabricator.wikimedia.org/T359641)