[00:15:41] (CloudVPSDesignateLeaks) firing: (5) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:25:41] (CloudVPSDesignateLeaks) firing: (5) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:50:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:57:56] (SystemdUnitDown) firing: The service unit wmf_auto_restart_virtlogd.service is in failed status on host cloudvirt1036. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1036 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:56:38] (03CR) 10Krinkle: [C:03+2] frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 (owner: 10Krinkle) [01:56:48] (03PS1) 10Krinkle: frontend: Flush headers before rendering the View [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016474 [01:57:22] (03CR) 10Krinkle: [C:03+2] frontend: Flush headers before rendering the View [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016474 (owner: 10Krinkle) [01:57:27] (03Merged) 10jenkins-bot: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 (owner: 10Krinkle) [01:58:10] (03Merged) 10jenkins-bot: frontend: Flush headers before rendering the View [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016474 (owner: 10Krinkle) [02:31:48] (03PS1) 10Krinkle: frontend: Add optional CODESEARCH_HOUND_BASE for local Hound API [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016479 [02:37:14] (03PS1) 10Krinkle: Revert "frontend: Server-side rendering" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016045 [02:41:37] (03PS2) 10Krinkle: Revert "frontend: Server-side rendering" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016045 [02:42:22] (03PS3) 10Krinkle: Revert "frontend: Server-side rendering" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016045 [02:42:54] (03CR) 10Krinkle: [C:03+2] Revert "frontend: Server-side rendering" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016045 (owner: 10Krinkle) [02:43:44] (03Merged) 10jenkins-bot: Revert "frontend: Server-side rendering" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016045 (owner: 10Krinkle) [02:52:56] (SystemdUnitDown) firing: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1036 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1036 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:53:00] 06cloud-services-team: SystemdUnitDown Unit wmf_auto_restart_virtlogd.service on node cloudvirt1036 has been down for long. - https://phabricator.wikimedia.org/T361662 (10phaultfinder) 03NEW [03:04:53] 06cloud-services-team, 10Cloud-VPS, 10Puppet (Puppet 7.0): 14Migrate Cloud VPS central puppet server to Puppet 7 - 14https://phabricator.wikimedia.org/T351451#9682747 (10Andrew) 05Open→03Resolved a:03Andrew [03:09:33] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Goal, 13Patch-For-Review, 10Puppet (Puppet 7.0): Migrate Cloud VPS puppet infrastructure to Puppet 7 - https://phabricator.wikimedia.org/T351450#9682752 (10Andrew) Hm... the puppet servers themselves are upgraded but I'm not sure when to actually... [04:05:50] (03CR) 10VolkerE: [C:03+2] releases: Bump Codex to 1.3.6 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1016454 (https://phabricator.wikimedia.org/T361472) (owner: 10Catrope) [04:15:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:15:44] (03Merged) 10jenkins-bot: releases: Bump Codex to 1.3.6 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1016454 (https://phabricator.wikimedia.org/T361472) (owner: 10Catrope) [05:08:59] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#9682860 (10Marostegui) I have updated the grants for ``wikiadmin2023`@`10.64.16.77` to unblock T361631 [05:25:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:31:42] 06cloud-services-team: Upgrade labtestwiki to MariaDB 10.6 - https://phabricator.wikimedia.org/T361666 (10Marostegui) 03NEW [06:21:18] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic: Set up a bitu instance for codfw1dev - https://phabricator.wikimedia.org/T360795#9682955 (10SLyngshede-WMF) Bitu is currently installed as a .deb package on two Ganeti hosts. [06:53:11] (SystemdUnitDown) firing: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1036 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1036 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:54:30] 06cloud-services-team: Upgrade labtestwiki to MariaDB 10.6 - https://phabricator.wikimedia.org/T361666#9683126 (10taavi) a:03taavi [08:01:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [08:04:32] 06cloud-services-team, 10Cloud-VPS: Linting problems found for NovafullstackSustainedFailures - https://phabricator.wikimedia.org/T351698#9683156 (10fgiunchedi) Indeed, I don't see the alert at https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem [08:11:28] (PuppetStaleCertificates) resolved: Found non-revoked Puppet certificates for 1 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [08:21:06] 06cloud-services-team, 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: Toolforge: Introduce grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9683242 (10dcaro) [08:30:55] 10Toolforge (Toolforge iteration 07): [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9683297 (10Slst2020) 05Open→03In progress [08:31:18] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662#9683304 (10dcaro) >>! In T348662#9680382,... [08:34:42] 06cloud-services-team, 13Patch-For-Review: 14Upgrade labtestwiki to MariaDB 10.6 - 14https://phabricator.wikimedia.org/T361666#9683309 (10taavi) 05Open→03Resolved 14This is complete. [08:41:06] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9683328 (10Slst2020) [08:41:27] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9683321 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/36 harbor: upgrade to 2.10.1 [08:50:31] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1036'] [08:50:39] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1036'] [08:52:42] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [08:52:48] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [08:56:50] RECOVERY - ensure kvm processes are running on cloudvirt1036 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:41] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1037.eqiad.wmnet' (T319184) [09:09:46] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [09:10:41] (CloudVPSDesignateLeaks) firing: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:15:41] (CloudVPSDesignateLeaks) firing: (5) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:20:41] (CloudVPSDesignateLeaks) firing: (5) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:26:28] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1037.eqiad.wmnet' (T319184) [09:26:33] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [09:27:25] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1037.eqiad.wmnet... [09:44:05] (03CR) 10Majavah: [C:03+2] php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [09:44:43] (03Merged) 10jenkins-bot: php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [09:49:27] (03PS1) 10Muehlenhoff: Remove dummy cert for debmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/1016726 (https://phabricator.wikimedia.org/T357750) [10:17:34] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1037.eqiad.wmnet with... [10:21:23] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1037'] [10:21:30] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1037'] [10:21:41] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [10:21:48] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [10:25:21] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683651 (10aborrero) [10:25:36] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1038.eqiad.wmnet' (T319184) [10:25:39] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [10:25:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:32:28] (InstanceDown) firing: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:34:02] 10Toolforge (Software install/update), 13Patch-For-Review: 14Install php-yaml in Toolforge images - 14https://phabricator.wikimedia.org/T361457#9683656 (10taavi) 05Open→03Resolved a:03Krinkle [10:34:57] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9683659 (10dcaro) [10:37:06] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1038.eqiad.wmnet' (T319184) [10:37:11] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [10:37:28] (InstanceDown) resolved: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:53:11] (SystemdUnitDown) firing: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1036 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1036 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:58:59] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1038.eqiad.wmnet... [11:06:41] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9683813 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/36 harbor: upgrade to 2.10.1 [11:08:02] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9683830 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/228 builds-bu... [11:21:29] !log taavi@runko tools START - Cookbook wmcs.vps.remove_instance for instance tools-proxy-06 [11:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:21:48] !log taavi@runko tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-proxy-06 [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:23:48] !log taavi@runko tools START - Cookbook wmcs.vps.remove_instance for instance tools-proxy-06 [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:23:53] !log taavi@runko tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-proxy-06 [11:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:24:41] !log taavi@runko tools START - Cookbook wmcs.vps.remove_instance for instance tools-proxy-06 [11:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:24:45] !log taavi@runko tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-proxy-06 [11:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:25:26] (SystemdUnitDown) resolved: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1036 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1036 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:27:15] !log taavi@runko cloudinfra START - Cookbook wmcs.vps.remove_instance for instance mx-out04 [11:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:27:26] !log taavi@runko cloudinfra END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance mx-out04 [11:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:28:08] !log taavi@runko cloudinfra START - Cookbook wmcs.vps.remove_instance for instance mx-out03 [11:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:28:18] !log taavi@runko cloudinfra END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=99) for instance mx-out03 [11:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:28:39] !log taavi@runko cloudinfra START - Cookbook wmcs.vps.remove_instance for instance mx-out04 [11:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:28:43] !log taavi@runko cloudinfra END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance mx-out04 [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:28:47] !log taavi@runko cloudinfra START - Cookbook wmcs.vps.remove_instance for instance mx-out03 [11:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:28:55] !log taavi@runko cloudinfra END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=99) for instance mx-out03 [11:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:29:47] !log taavi@runko cloudinfra START - Cookbook wmcs.vps.remove_instance for instance mx-out03 [11:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:30:12] !log taavi@runko cloudinfra END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance mx-out03 [11:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [11:31:28] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1038'] [11:31:50] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1038'] [11:33:02] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [11:33:08] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [11:34:24] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1039.eqiad.wmnet' (T319184) [11:34:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 2 deleted instances on cloudinfra-internal-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [11:34:28] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [11:35:09] (03PS1) 10Majavah: vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 [11:35:13] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1038.eqiad.wmnet with... [11:35:29] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683893 (10aborrero) [11:36:33] (03CR) 10CI reject: [V:04-1] vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 (owner: 10Majavah) [11:38:26] (03PS2) 10Majavah: vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 [11:43:59] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563#9683942 (10taavi) 05Resolved→03Open [11:44:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563#9683944 (10taavi) Somethig is still broken somewhere: ` taavi@cloudinfra-internal-puppetserver-1:~$ sudo puppet node... [11:48:20] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1039.eqiad.wmnet' (T319184) [11:48:25] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [11:50:57] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1039.eqiad.wmnet... [11:55:54] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683981 (10aborrero) [12:02:09] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: 14[cloudinfra] puppet CA cert expired - 14https://phabricator.wikimedia.org/T361563#9683992 (10taavi) 05Open→03Resolved 14`lang=shell-session root@cloudinfra-internal-puppetserver-1:/srv/... [12:03:58] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#9684000 (10Slst2020) 05Stalled→03Open [12:04:28] (PuppetStaleCertificates) resolved: Found non-revoked Puppet certificates for 2 deleted instances on cloudinfra-internal-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [12:04:30] 10Toolforge (Toolforge iteration 07): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9684006 (10Slst2020) 05Stalled→03Open [12:12:41] (CloudVPSDesignateLeaks) firing: Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:31:04] 06cloud-services-team, 10Toolforge: Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110#9684105 (10aborrero) It could be interesting to have {T357977} in place before this change, to ease in the migration. [12:34:28] 10Toolforge (Toolforge iteration 07): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698 (10Slst2020) 03NEW [12:34:50] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1039.eqiad.wmnet with... [12:35:28] 10Toolforge (Toolforge iteration 07): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9684133 (10Slst2020) [12:35:29] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9684134 (10Slst2020) [12:36:20] 10Toolforge (Toolforge iteration 07): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9684147 (10Slst2020) [12:36:24] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#9684146 (10Slst2020) [12:39:47] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1039'] [12:39:49] 10Toolforge (Toolforge iteration 07): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9684168 (10Slst2020) [12:39:49] 10Toolforge (Toolforge iteration 07): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9684167 (10Slst2020) [12:39:55] 14Toolforge Build Service: Toolforge build service (Rust) fails with "too many open files" - https://phabricator.wikimedia.org/T361700 (10Magnus) 03NEW [12:40:09] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1039'] [12:40:24] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1039'] [12:40:27] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1039'] [12:40:38] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [12:40:44] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [12:41:28] 10Toolforge: [buildservice] "failed to create fsnotify watcher: too many open files" and "unable to open destination: open /tekton/home/.docker/config.json: permission denied" - https://phabricator.wikimedia.org/T361519#9684177 (10taavi) [12:42:22] 14Toolforge Build Service: 14Toolforge build service (Rust) fails with "too many open files" - 14https://phabricator.wikimedia.org/T361700#9684175 (10taavi) →14Duplicate dup:03T361519 [12:43:25] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9684186 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/228 builds-builder: bump to 0.0.95-20240403110641-05... [12:44:26] (03CR) 10FNegri: [C:03+1] "Thanks, LGTM! I left a small comment inline." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 (owner: 10Majavah) [12:45:21] 10Toolforge: [buildservice] "failed to create fsnotify watcher: too many open files" and "unable to open destination: open /tekton/home/.docker/config.json: permission denied" - https://phabricator.wikimedia.org/T361519#9684194 (10dcaro) Just note that the permission denied for the docker creds is an expected lo... [12:51:23] (03CR) 10Elukey: [V:03+2 C:03+2] Remove profile::pki::client's specific hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:52:12] (03CR) 10Majavah: "No, this is needed for PCC runs for wikiproduction hosts..." [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:53:14] (03CR) 10Elukey: [V:03+2 C:03+2] "There is already a value in common.yaml, it should be fine to just use that one, no? I think it is confusing to keep two values.." [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:55:39] (03CR) 10Majavah: "I don't think namespaced keys are looked up from common.yaml in production, but I might be wrong?" [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:58:35] (03CR) 10Elukey: [V:03+2 C:03+2] "I was convinced they were, but then I discovered https://phabricator.wikimedia.org/T209265. This task unveils horrible holes in my puppet " [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [13:02:23] (03PS1) 10Elukey: profile::pki::client: re-introduce fake auth token [labs/private] - 10https://gerrit.wikimedia.org/r/1016764 (https://phabricator.wikimedia.org/T360595) [13:03:32] (03CR) 10Elukey: [V:03+2 C:03+2] profile::pki::client: re-introduce fake auth token [labs/private] - 10https://gerrit.wikimedia.org/r/1016764 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [13:12:31] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9684323 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforg... [13:27:38] 10Toolforge: [infra,builds-builder] "failed to create fsnotify watcher: too many open files" and "unable to open destination: open /tekton/home/.docker/config.json: permission denied" - https://phabricator.wikimedia.org/T361519#9684375 (10dcaro) [13:27:39] 10Toolforge: [infra,builds-builder] "failed to create fsnotify watcher: too many open files" and "unable to open destination: open /tekton/home/.docker/config.json: permission denied" - https://phabricator.wikimedia.org/T361519#9684373 (10dcaro) p:05Triage→03Medium [13:27:59] 06cloud-services-team, 10Toolforge: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110#9684376 (10dcaro) [13:28:13] 10Toolforge: [infra,puppet] Toolforge should not re-invent profile::mail::default_mail_relay - https://phabricator.wikimedia.org/T360651#9684382 (10dcaro) [13:28:32] 10Toolforge: [builds-builder] Cache .m2 folder (local maven repository) between builds - https://phabricator.wikimedia.org/T350307#9684385 (10dcaro) [13:30:31] 10Toolforge: [infra,builds-builder] "failed to create fsnotify watcher: too many open files" - https://phabricator.wikimedia.org/T361519#9684392 (10dcaro) [13:33:22] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:53:22] (HAProxyBackendUnavailable) resolved: HAProxy service nova-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:58:24] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9684474 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforg... [14:09:19] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1040.eqiad.wmnet' (T319184) [14:09:24] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [14:16:11] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1040.eqiad.wmnet' (T319184) [14:16:16] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [14:17:02] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1040.eqiad.wmnet... [14:24:55] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684602 (10aborrero) [14:29:06] !log raymond@ubuntu toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:30:02] !log raymond@ubuntu toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [14:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:32:03] 06cloud-services-team, 10Toolforge: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110#9684634 (10aborrero) [14:37:48] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:37:52] !log raymond@ubuntu tools END (FAIL) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=99) for component builds-api [14:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:47:28] (InstanceDown) firing: Project tools instance tools-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:47:28] 10Toolforge (Toolforge iteration 07): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708 (10Slst2020) 03NEW [14:47:30] 06cloud-services-team, 10Cloud-VPS: 14Linting problems found for NovafullstackSustainedFailures - 14https://phabricator.wikimedia.org/T351698#9684711 (10Andrew) 05Open→03Resolved [14:47:46] 10Toolforge (Toolforge iteration 07): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#9684713 (10Slst2020) [14:47:49] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9684712 (10Slst2020) [14:48:33] 10Toolforge (Toolforge iteration 07): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9684716 (10Slst2020) [14:48:33] 10Toolforge (Toolforge iteration 07): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#9684717 (10Slst2020) [14:49:26] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:49:32] !log raymond@ubuntu tools END (FAIL) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=99) for component builds-api [14:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:50:47] 10Toolforge (Toolforge iteration 07), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#9684722 (10Slst2020) 05Stalled→03In progress [14:52:28] (InstanceDown) resolved: Project tools instance tools-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:57:42] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:57:47] !log raymond@ubuntu tools END (FAIL) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=99) for component builds-api [14:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:58:17] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:58:23] !log raymond@ubuntu tools END (FAIL) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=99) for component builds-api [14:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:59:48] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:59:52] !log raymond@ubuntu tools END (FAIL) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=99) for component builds-api [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:00:24] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1040'] [15:00:28] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [15:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:00:47] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1040'] [15:01:22] !log raymond@ubuntu tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [15:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:01:32] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [15:01:38] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [15:01:47] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1040.eqiad.wmnet with... [15:14:31] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [jobs-cli] Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575#9684811 (10CodeReviewBot) aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/17 tjf_cli: add dump operation to lis... [15:14:35] 10Toolforge (Toolforge iteration 07), 07Software-Licensing: [builds-api] builds-api is missing a software license - https://phabricator.wikimedia.org/T361007#9684812 (10Slst2020) To do this right, we need to add a license notice to every file in the repo in addition to a COPYING file. Is that correct? [15:22:54] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [jobs-cli] Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575#9684842 (10CodeReviewBot) aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/19 d/changelog: bump to 16.0.3 [15:22:55] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: 14[jobs-cli,jobs-api] Allow using file logs with build service images - 14https://phabricator.wikimedia.org/T353537#9684843 (10CodeReviewBot) 14aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/19 d/chang... [15:34:31] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [jobs-cli] Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575#9684910 (10CodeReviewBot) aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/19 d/changelog: bump to 16.0.3 [15:36:48] 10Toolforge (Toolforge iteration 07), 07Software-Licensing: [builds-api] builds-api is missing a software license - https://phabricator.wikimedia.org/T361007#9684913 (10aborrero) Ideally, a SPDX license header on every source file + a `LICENSE` file in the root of the repo. [15:36:49] (03PS3) 10Majavah: vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 [15:37:32] (03CR) 10Majavah: vps: remove_instance: Improve Puppet certificate handling (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 (owner: 10Majavah) [15:40:19] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9684917 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforg... [15:43:50] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9684930 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/63 [jobs-api] support job health checks [15:46:13] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9684951 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requ... [16:04:20] !log raymond@ubuntu toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [16:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [16:05:17] !log raymond@ubuntu toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [16:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [16:16:00] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api] Remove flask-restful - https://phabricator.wikimedia.org/T359806#9685025 (10dcaro) [16:16:01] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-cli] Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575#9685023 (10dcaro) [16:16:05] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [harbor] upgrade to 2.10.1 - https://phabricator.wikimedia.org/T354507#9685021 (10dcaro) [16:16:17] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support services in jobs - https://phabricator.wikimedia.org/T348758#9685029 (10dcaro) [16:16:38] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [toolforge-cd] remove duplicated run on tag and push to master (just do one if possible) - https://phabricator.wikimedia.org/T353563#9685031 (10dcaro) [16:16:40] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [builds-builder,builds-admission] Remove direct access to tekton from tools and remove the admission controller - https://phabricator.wikimedia.org/T360329#9685027 (10dcaro) [16:16:42] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9685033 (10dcaro) [16:17:35] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9685037 (10dcaro) [16:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:17:48] 10Toolforge (Toolforge iteration 08): [builds-builder,jobs-api] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#9685039 (10dcaro) [16:18:04] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [maintain-harbor] Improvements to subcommands and config validation - https://phabricator.wikimedia.org/T353059#9685035 (10dcaro) [16:18:04] 10Toolforge (Toolforge iteration 08), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#9685041 (10dcaro) [16:18:07] 10Toolforge (Toolforge iteration 08): Upgrade Toolforge front proxies to Bookworm - https://phabricator.wikimedia.org/T361223#9685045 (10dcaro) [16:18:23] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9685047 (10dcaro) [16:18:31] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 08), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#9685053 (10dcaro) [16:19:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 08), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [builds-api,orchestration] Automatically deploy the webservice when the image is built - https://phabricator.wikimedia.org/T341065#9685043 (10dcaro) [16:19:30] 10Toolforge (Toolforge iteration 08), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#9685049 (10dcaro) [16:19:31] 10Toolforge (Toolforge iteration 08): [toolforge-cli,jobs-cli,builds-cli,envvars-cli] Explore OpenAPI SDK tooling for client consolidation - https://phabricator.wikimedia.org/T356261#9685055 (10dcaro) [16:19:33] 10Toolforge (Toolforge iteration 08): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#9685062 (10dcaro) [16:19:36] 10Toolforge (Toolforge iteration 08): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9685064 (10dcaro) [16:19:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: Toolforge: Introduce grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9685051 (10dcaro) [16:19:43] 10Toolforge (Toolforge iteration 08): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9685066 (10dcaro) [16:19:48] 10Toolforge (Toolforge iteration 08), 07Software-Licensing: [builds-api] builds-api is missing a software license - https://phabricator.wikimedia.org/T361007#9685068 (10dcaro) [16:19:50] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#9685067 (10dcaro) [16:19:51] 10Toolforge (Toolforge iteration 08): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9685069 (10dcaro) [16:19:53] 10Toolforge (Toolforge iteration 08): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9685070 (10dcaro) [16:19:55] 10Toolforge (Toolforge iteration 08): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9685072 (10dcaro) [16:19:58] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9685071 (10dcaro) [16:20:02] 10Toolforge (Toolforge iteration 08): [toolforge] simplify calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377#9685073 (10dcaro) [16:20:07] 10Toolforge (Toolforge iteration 08), 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#9685074 (10dcaro) [16:20:10] 06cloud-services-team, 10Toolforge (Toolforge iteration 08): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9685075 (10dcaro) [16:20:15] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 08): [docs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313#9685076 (10dcaro) [16:20:19] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 08): [builds-api] Add dashboards with the new statistics - https://phabricator.wikimedia.org/T352764#9685077 (10dcaro) [16:20:23] 10Cloud Services Proposals, 10cloud-services-team (FY2023/2024-Q3-Q4): Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887#9685078 (10fnegri) I have created a draft document that is a WMCS version of [this page](https://wikitech.wikimedia.org/wiki/Incident_response/Runbook)... [16:21:13] (03CR) 10FNegri: [C:03+1] vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 (owner: 10Majavah) [16:21:21] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,buildservice-api,envvars-api] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745#9685057 (10dcaro) [16:23:50] 10Toolforge (Toolforge iteration 08): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9685099 (10dcaro) p:05Triage→03Medium [16:23:56] 10Toolforge (Toolforge iteration 08): [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#9685101 (10dcaro) p:05Triage→03Medium [16:28:28] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9685143 (10Dzahn) ` Error 500 on SERVER: Server Error: Could not find class role::puppetserver::standalone for puppetmaster-1003.devtools.... [16:33:00] (03CR) 10Majavah: [C:03+2] vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 (owner: 10Majavah) [16:33:56] (SystemdUnitDown) firing: The service unit postgresql@15-main.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:35:54] (03Merged) 10jenkins-bot: vps: remove_instance: Improve Puppet certificate handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016748 (owner: 10Majavah) [16:37:30] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9685179 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/16 [jobs-cli] support job health checks [16:51:01] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:56:41] (PrometheusRestarted) firing: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [16:57:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:06:15] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9685347 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/20 d/changelog: bump to 16.0.4 [17:13:50] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9685366 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/20 d/changelog: bump to 16.0.4 [17:16:41] (PrometheusRestarted) firing: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [17:21:41] (PrometheusRestarted) firing: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [17:21:56] (PrometheusRestarted) firing: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [17:26:41] (PrometheusRestarted) firing: (2) Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [17:36:28] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/operations/puppet on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [17:41:32] 10Toolforge: "ftl" tool's perl5.32 webservice pod being frequently killed due to liveness probe failures - https://phabricator.wikimedia.org/T361652#9685461 (10bd808) `lang=shell-session tools.ftl@tools-sgebastion-10:~$ kubectl get events LAST SEEN TYPE REASON OBJECT MESSAGE 57m... [17:41:41] (PrometheusRestarted) resolved: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [17:52:51] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9685500 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/231 jobs-api: bump to 0.0.271-... [18:08:01] 10Cloud-VPS (Project-requests): Reassign cloud VPS project "media-streaming" to bvibber - https://phabricator.wikimedia.org/T361730 (10bvibber) 03NEW [18:12:48] (PuppetFailure) firing: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:12:55] 06cloud-services-team: PuppetFailure Puppet failure on cloudbackup1001-dev:9100 - https://phabricator.wikimedia.org/T361731 (10phaultfinder) 03NEW [18:22:48] (PuppetFailure) firing: (2) Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:22:54] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T361732 (10phaultfinder) 03NEW [18:50:56] (SystemdUnitDown) firing: The systemd unit postgresql@15-main.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:51:02] 06cloud-services-team: SystemdUnitDown Unit postgresql@15-main.service on node cloudbackup1001-dev has been down for long. - https://phabricator.wikimedia.org/T361733 (10phaultfinder) 03NEW [19:08:56] (SystemdUnitDown) firing: The service unit backup_cinder_volumes.service is in failed status on host cloudbackup1002-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:12:41] (CloudVPSDesignateLeaks) firing: (2) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:15:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:16:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [19:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:17:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:19:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [19:22:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:24:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [19:59:38] 10Cloud-VPS (Project-requests): 14Reassign cloud VPS project "media-streaming" to bvibber - 14https://phabricator.wikimedia.org/T361730#9685907 (10bd808) 05Open→03Resolved a:03bd808 14{{Done}} I left your legacy account as a member of the project. Please feel free to remove it if there is no longer an... [20:10:00] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9685971 (10Andrew) I'm not sure this is the cause of the problem, but is there any reason to have your new puppetserver manage itself rath... [20:22:41] (CloudVPSDesignateLeaks) firing: (3) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:27:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:46:31] 06cloud-services-team, 10Cloud-VPS: cloud-init timeout too short on Bookworm - https://phabricator.wikimedia.org/T361749 (10Andrew) 03NEW [20:51:16] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:00:56] (SystemdUnitDown) firing: The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:01:02] 06cloud-services-team: SystemdUnitDown Unit backup_cinder_volumes.service on node cloudbackup1001-dev has been down for long. - https://phabricator.wikimedia.org/T361751 (10phaultfinder) 03NEW [21:05:56] (SystemdUnitDown) firing: (2) The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:06:01] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T360279#9686146 (10phaultfinder) [21:15:39] 10Toolforge: "ftl" tool's perl5.32 webservice pod being frequently killed due to liveness probe failures - https://phabricator.wikimedia.org/T361652#9686179 (10bd808) Zooming out to look at the last 30 days of http status data, I'm actually wondering now if the problem is just horizontal scaling of this Perl CGI... [21:29:02] 06cloud-services-team, 10Cloud-VPS: cloud-init timeout too short on Bookworm - https://phabricator.wikimedia.org/T361749#9686209 (10LucasWerkmeister) Apparently the Bookworm package no longer ships systemd unit files, so systemd is falling back to the SysV init scripts: `lang=shell-session,name=Buster lucaswe... [21:34:04] 06cloud-services-team, 10Cloud-VPS: cloud-init timeout too short on Bookworm - https://phabricator.wikimedia.org/T361749#9686251 (10taavi) `dpkg -L` only shows files from installed packages, and for some reason cloud-init is showing as not installed on Bookworm instances: `lang=shell-session taavi@tools-bastio... [21:37:21] 06cloud-services-team, 10Cloud-VPS: cloud-init timeout too short on Bookworm - https://phabricator.wikimedia.org/T361749#9686258 (10Dzahn) >>! In T361749#9686209, @LucasWerkmeister wrote: > Apparently the Bookworm package no longer ships systemd unit files, so systemd is falling back to the SysV init scripts:... [21:48:15] (03PS2) 10Dzahn: delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413) [21:48:30] (03CR) 10Dzahn: [V:03+2 C:03+2] delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [21:59:35] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9686403 (10bking) [22:42:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance cloudinfra-cloudvps-puppetserver-2 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:47:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance cloudinfra-cloudvps-puppetserver-2 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [23:10:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance metricsinfra-puppetserver-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:10:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:14:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance project-proxy-puppetserver-1 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:17:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance clouddb-services-puppetserver-1 on project clouddb-services - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:20:28] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:21:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance cloudinfra-internal-puppetserver-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:28:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance cloudinfra-cloudvps-puppetserver-2 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [23:34:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance gitlab-runners-puppetserver-01 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:38:28] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance cloudinfra-cloudvps-puppetserver-1 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [23:47:16] (03PS2) 10Krinkle: frontend: Add optional CODESEARCH_HOUND_BASE for local Hound API [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016479 [23:47:16] (03PS1) 10Krinkle: frontend: Add ?debug=1 to access debug log messages [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016886 [23:47:17] (03PS1) 10Krinkle: frontend: Change Dockerport to expose port 3003 instead of port 80 [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016887 [23:48:28] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance cloudinfra-cloudvps-puppetserver-1 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [23:48:30] (03CR) 10Krinkle: [C:03+2] frontend: Add ?debug=1 to access debug log messages [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016886 (owner: 10Krinkle) [23:48:42] (03CR) 10Krinkle: [C:03+2] frontend: Add optional CODESEARCH_HOUND_BASE for local Hound API [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016479 (owner: 10Krinkle) [23:49:20] (03Merged) 10jenkins-bot: frontend: Add ?debug=1 to access debug log messages [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016886 (owner: 10Krinkle) [23:49:29] (03Merged) 10jenkins-bot: frontend: Add optional CODESEARCH_HOUND_BASE for local Hound API [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016479 (owner: 10Krinkle)