[00:01:58] 10Toolforge (Software install/update): Missing packages on dev.toolforge.org - https://phabricator.wikimedia.org/T360488#9644465 (10JJMC89) [00:08:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:11:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:13:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:13:47] 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9644511 (10bd808) [00:13:56] 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9644509 (10bd808) @Anomie, can you run the scripts that need these Perl libraries from inside of a `webservice perl5.32 shell` container session? Or do they also n... [00:16:49] (TfInfraTestDestroyFailed) resolved: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:24:17] 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9644546 (10bd808) Something that is probably under advertised related to my `webservice perl5.32 shell` question is that webservice passes extra cli args into the... [00:26:01] 10Toolforge (Software install/update): Provide a Redis container for use within a tool's namespace - https://phabricator.wikimedia.org/T360378#9644548 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808 [00:30:04] (03PS1) 10BryanDavis: tox: Bump Python interpreter to 3.9 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012796 [00:30:06] (03PS1) 10BryanDavis: Add redis image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) [00:33:41] (03CR) 10BryanDavis: [C:03+2] tox: Bump Python interpreter to 3.9 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012796 (owner: 10BryanDavis) [00:34:15] (03Merged) 10jenkins-bot: tox: Bump Python interpreter to 3.9 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012796 (owner: 10BryanDavis) [01:17:39] (ProbeDown) firing: Service toolsbeta-test-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:18:50] (ProbeDown) firing: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:22:38] (ProbeDown) resolved: Service toolsbeta-test-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:23:50] (ProbeDown) resolved: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:18:58] (03CR) 10Andrew Bogott: [C:03+1] vps: refresh_puppet_certs: Fix for Puppet 7 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012355 (https://phabricator.wikimedia.org/T351453) (owner: 10Majavah) [04:00:05] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9644719 (10Andrew) [04:01:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9644721 (10Andrew) a:05Andrew→03Jhancock.wm Sorry for the slow response! I hope I've now included all that you need. [06:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:05:50] (ProbeDown) firing: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:10:50] (ProbeDown) resolved: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:17:45] (03CR) 10Majavah: [C:03+2] vps: refresh_puppet_certs: Fix for Puppet 7 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012355 (https://phabricator.wikimedia.org/T351453) (owner: 10Majavah) [09:17:53] (03CR) 10Majavah: [C:03+2] vps: refresh_puppet_certs: Fix Puppet agent profile name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012359 (owner: 10Majavah) [09:18:03] (03CR) 10Majavah: [C:03+2] vps: remove_instance: Use Puppet 7 for cert cleanup [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012370 (https://phabricator.wikimedia.org/T351453) (owner: 10Majavah) [09:21:29] (03Merged) 10jenkins-bot: vps: refresh_puppet_certs: Fix for Puppet 7 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012355 (https://phabricator.wikimedia.org/T351453) (owner: 10Majavah) [09:21:30] (03Merged) 10jenkins-bot: vps: refresh_puppet_certs: Fix Puppet agent profile name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012359 (owner: 10Majavah) [09:21:31] (03Merged) 10jenkins-bot: vps: remove_instance: Use Puppet 7 for cert cleanup [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1012370 (https://phabricator.wikimedia.org/T351453) (owner: 10Majavah) [09:33:28] (InstanceDown) firing: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:43:28] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:02:42] 10Toolforge: [harbor] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509 (10Slst2020) 03NEW [10:05:27] 10Toolforge: [harbor] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#9645131 (10dcaro) One issue of doing it at deploy/install time is that you can't change them after without redepolying. [10:05:55] 10Toolforge: [harbor] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#9645133 (10dcaro) Note also that maintain-harbor already manages the policies for all the other projects in harbor. [10:10:11] 10Toolforge: php-cgi for dev.toolforge.org - https://phabricator.wikimedia.org/T360511 (10Steenth) 03NEW [10:11:47] 10Toolforge: php-cli for dev.toolforge.org - https://phabricator.wikimedia.org/T360511#9645162 (10Steenth) [10:21:54] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-checker' [10:22:09] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-checker' [10:31:23] (03PS1) 10Majavah: vps: create_instance: do not assume k8s-specific security group [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013013 [10:32:40] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.openstack.quota_increase [10:32:48] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [10:33:23] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-checker' [10:33:42] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-checker' [10:34:11] (03CR) 10CI reject: [V:04-1] vps: create_instance: do not assume k8s-specific security group [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013013 (owner: 10Majavah) [10:34:14] (03PS1) 10Muehlenhoff: Remove labweb.discovery.wmnet dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013014 [10:34:22] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-checker' [10:34:25] (03PS2) 10Majavah: vps: create_instance: do not assume k8s-specific security group [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013013 [10:35:51] 10Toolforge (Toolforge iteration 07), 07Epic: Upgrade toolschecker hosts to bookworm - https://phabricator.wikimedia.org/T360514 (10taavi) 03NEW [10:37:08] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-checker' [10:37:17] (03CR) 10CI reject: [V:04-1] vps: create_instance: do not assume k8s-specific security group [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013013 (owner: 10Majavah) [10:38:47] (03PS3) 10Majavah: vps: create_instance: do not assume k8s-specific security group [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013013 [10:39:03] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.refresh_puppet_certs on tools-checker-5.tools.eqiad1.wikimedia.cloud [10:40:49] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=99) on tools-checker-5.tools.eqiad1.wikimedia.cloud [10:43:26] 10Toolforge: [harbor] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#9645235 (10Slst2020) >>! In T360509#9645133, @dcaro wrote: > Note also that maintain-harbor already manages the policies for all the other projects in harbor. Then maybe add this to mainta... [10:43:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance tools-checker-5 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:45:20] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.refresh_puppet_certs on tools-checker-5.tools.eqiad1.wikimedia.cloud [10:49:24] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-checker-5.tools.eqiad1.wikimedia.cloud [10:49:47] (03CR) 10Majavah: [C:03+1] "Thanks! Forgot that this existed too." [labs/private] - 10https://gerrit.wikimedia.org/r/1013014 (owner: 10Muehlenhoff) [10:52:51] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove labweb.discovery.wmnet dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1013014 (owner: 10Muehlenhoff) [10:53:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance tools-checker-5 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:04:57] 10Toolforge: [harbor] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#9645274 (10dcaro) [11:16:43] !log dcaro@urcuchillay toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [11:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:17:16] !log dcaro@urcuchillay toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [11:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:18:15] 10Toolforge: [harbor,infra] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#9645314 (10dcaro) [11:18:37] 10Toolforge: [harbor,infra] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#9645311 (10dcaro) p:05Triage→03Medium [11:24:20] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [11:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:24:54] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [11:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:45:29] 06cloud-services-team, 10wikitech.wikimedia.org: Disable SSH key management on Wikitech - https://phabricator.wikimedia.org/T359544#9645391 (10SLyngshede-WMF) [12:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:10:41] (CloudVPSDesignateLeaks) firing: Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:11:05] 10Toolforge: 14php-cli for dev.toolforge.org - 14https://phabricator.wikimedia.org/T360511#9645569 (10dcaro) 05Open→03Resolved a:03dcaro 14You can get a shell with php and curl from your tool with access to all your scripts by running: ` tools.dcaro-test11@tools-sgebastion-10:~$ toolforge webservice p... [13:15:41] (CloudVPSDesignateLeaks) firing: (5) Detected 16 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:16:21] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9645583 (10Andrew) [13:17:03] 06cloud-services-team, 10VPS-Projects, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Puppet (Puppet 7.0): Update Integration project puppetmaster - https://phabricator.wikimedia.org/T360461#9645584 (10Andrew) [13:17:05] 10Tools: 'hoiscript' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349913#9645585 (10dcaro) @Hoi The files are still there, I'm guessing you did not have the time? Are you encountering any errors? ` droot@tools-nfs-2:~# du -hs /srv/tools/project/hoiscript/public_html/* | so... [13:17:12] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update gitlab-runners project puppetmaster - https://phabricator.wikimedia.org/T360459#9645586 (10Andrew) [13:24:59] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9645598 (10Andrew) [13:28:53] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9645610 (10Andrew) [13:29:42] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9645612 (10Andrew) [13:32:17] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9645624 (10Andrew) [13:34:12] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-checker-04 [13:35:05] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-checker-04 [13:35:14] 06cloud-services-team, 10VPS-Projects, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10Puppet (Puppet 7.0): Update Integration project puppetmaster - https://phabricator.wikimedia.org/T360461#9645631 (10Andrew) [13:35:26] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update gitlab-runners project puppetmaster - https://phabricator.wikimedia.org/T360459#9645632 (10Andrew) [13:39:15] 06cloud-services-team, 10Toolforge: 14Upgrade Toolforge acme-chief hosts to Debian Bullseye or later - 14https://phabricator.wikimedia.org/T311907#9645644 (10taavi) 05Open→03Resolved [13:39:17] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge, 07Epic, 05Goal: Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#9645645 (10taavi) [13:41:28] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [13:56:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance cloudinfra-internal-puppetmaster-02 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:00:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance project-proxy-puppetmaster-01 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:11:28] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance cloud-puppetmaster-03 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:11:28] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [14:15:28] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance project-proxy-puppetmaster-01 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:16:19] 14Grid-Engine-to-K8s-Migration: 14Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - 14https://phabricator.wikimedia.org/T319883#9645804 (10dcaro) 14>>! In T319883#9640650, @MBH wrote: > Thank you very much, I will try to rewrite my tools to dotnet app in the coming weeks. But after you updated... [14:21:42] 10Toolforge (Toolforge iteration 07), 07Epic: Upgrade toolschecker hosts to bookworm - https://phabricator.wikimedia.org/T360514#9645851 (10dcaro) p:05Triage→03High [14:22:10] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge, 07Epic, 05Goal: Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#9645865 (10taavi) [14:22:22] 10Toolforge: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9645871 (10dcaro) a:03dcaro [14:22:39] 10Toolforge (Toolforge iteration 07): 14Upgrade toolschecker hosts to bookworm - 14https://phabricator.wikimedia.org/T360514#9645872 (10taavi) [14:23:03] 10Toolforge (Toolforge iteration 07): [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9645875 (10dcaro) [14:23:25] 10Toolforge (Toolforge iteration 07): 14Upgrade toolschecker hosts to bookworm - 14https://phabricator.wikimedia.org/T360514#9645864 (10taavi) 05Open→03Resolved [14:26:28] (PuppetAgentFailure) resolved: (3) Puppet agent failure detected on instance cloud-puppetmaster-03 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:31:04] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.drain_node (T348643) [14:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:31:11] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [14:37:29] (03PS1) 10David Caro: ceph.drain: fix draining only the first osd of the batch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013061 [14:56:32] (03CR) 10David Caro: [C:03+2] ceph.drain: fix draining only the first osd of the batch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013061 (owner: 10David Caro) [14:56:50] 14Grid-Engine-to-K8s-Migration: 14Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - 14https://phabricator.wikimedia.org/T319883#9645982 (10MBH) 14Thanks, works now. [14:57:33] 10Toolforge (Software install/update): 14php-cli for dev.toolforge.org - 14https://phabricator.wikimedia.org/T360511#9645983 (10JJMC89) [14:58:24] 10Toolforge (Software install/update): 14php-cli for dev.toolforge.org - 14https://phabricator.wikimedia.org/T360511#9645985 (10JJMC89) 05Resolved→03Declined [15:00:42] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T348643) [15:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:00:47] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [15:00:53] (03Merged) 10jenkins-bot: ceph.drain: fix draining only the first osd of the batch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013061 (owner: 10David Caro) [15:02:55] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.drain_node (T348643) [15:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:07:50] (03PS5) 10David Caro: ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 [15:07:58] (03PS5) 10David Caro: ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 [15:08:06] (03PS6) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [15:08:14] (03PS6) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [15:08:23] (03PS6) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [15:09:05] (03CR) 10CI reject: [V:04-1] ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 (owner: 10David Caro) [15:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:12:25] (03CR) 10CI reject: [V:04-1] ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 (owner: 10David Caro) [15:12:31] (03PS6) 10David Caro: ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 [15:12:32] (03PS6) 10David Caro: ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 [15:12:34] (03PS7) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [15:12:42] (03PS7) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [15:12:50] (03PS7) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [15:16:14] (03CR) 10CI reject: [V:04-1] ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 (owner: 10David Caro) [15:16:30] (03CR) 10CI reject: [V:04-1] ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 (owner: 10David Caro) [15:16:40] (03CR) 10CI reject: [V:04-1] ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 (owner: 10David Caro) [15:20:41] (CloudVPSDesignateLeaks) firing: (5) Detected 20 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:25:40] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T348643) [15:25:42] (CloudVPSDesignateLeaks) resolved: (5) Detected 20 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:25:45] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [15:27:57] (03PS4) 10Majavah: vps: create_instance: do not assume k8s-specific security group [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013013 [15:27:57] (03PS1) 10Majavah: vps: create_instance: Add flag to sign Puppet certs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013080 [15:27:59] (03PS1) 10Majavah: wmcs_libs: openstack: improve Neutron port handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013081 [15:28:06] (03PS1) 10Majavah: toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) [15:28:57] PROBLEM - Host cloudcephosd1030 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:29] 06cloud-services-team, 10Toolforge (Toolforge iteration 07), 07Kubernetes, 13Patch-For-Review: [infra] Upgrade Toolforge K8s haproxies to Bookworm - https://phabricator.wikimedia.org/T349206#9646074 (10taavi) a:03taavi [15:31:19] (03CR) 10CI reject: [V:04-1] wmcs_libs: openstack: improve Neutron port handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013081 (owner: 10Majavah) [15:31:26] (03CR) 10CI reject: [V:04-1] toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [15:31:32] (03CR) 10CI reject: [V:04-1] vps: create_instance: Add flag to sign Puppet certs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013080 (owner: 10Majavah) [15:33:05] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9646108 (10bking) Unfortunately, we are plus the likelihood that there wi... [15:35:17] (03PS2) 10Majavah: vps: create_instance: Add flag to sign Puppet certs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013080 [15:35:18] (03PS2) 10Majavah: wmcs_libs: openstack: improve Neutron port handling [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013081 [15:35:20] (03PS2) 10Majavah: toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) [15:38:27] (03CR) 10CI reject: [V:04-1] toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [15:39:22] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.drain_node (T348643) [15:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:39:27] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [15:41:16] (03PS3) 10Majavah: toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) [15:44:45] (03CR) 10CI reject: [V:04-1] toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [15:49:19] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_haproxy_node [15:52:45] (03CR) 10CI reject: [V:04-1] toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) (owner: 10Majavah) [15:55:52] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_haproxy_node (exit_code=0) [15:57:58] (03PS5) 10Majavah: toolforge: Add cookbook to add new K8s HAProxy node [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013082 (https://phabricator.wikimedia.org/T349206) [16:03:02] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 07), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#9646234 (10taavi) [16:06:50] 10PAWS: Upgrade julia - https://phabricator.wikimedia.org/T360539 (10rook) 03NEW [16:09:04] 10PAWS: Upgrade julia - https://phabricator.wikimedia.org/T360539#9646287 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/391 [16:09:26] vivian-rook opened https://github.com/toolforge/paws/pull/391 [16:11:14] 10Tool-openstack-browser: List extra allowed service IPs on server detail - https://phabricator.wikimedia.org/T360541 (10taavi) 03NEW [16:11:29] 10Tool-openstack-browser: openstack-browser: List public object storage buckets - https://phabricator.wikimedia.org/T348884#9646323 (10taavi) [16:31:11] RECOVERY - Host cloudcephosd1030 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [17:02:12] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T348643) [17:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:02:17] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [17:03:00] 10PAWS: Upgrade julia - https://phabricator.wikimedia.org/T360539#9646549 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/391 [17:03:09] 10PAWS: 14Upgrade julia - 14https://phabricator.wikimedia.org/T360539#9646552 (10rook) 05Open→03Resolved [17:03:47] vivian-rook closed https://github.com/toolforge/paws/pull/391 [18:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:41:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 14 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [19:48:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 12 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [19:51:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 670 deleted instances on cloudinfra-cloudvps-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [19:56:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [20:11:28] (PuppetStaleCertificates) firing: (2) Found non-revoked Puppet certificates for 672 deleted instances on cloudinfra-cloudvps-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [20:13:28] (InstanceDown) firing: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:23:28] (InstanceDown) resolved: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:26:28] (PuppetStaleCertificates) firing: (2) Found non-revoked Puppet certificates for 673 deleted instances on cloudinfra-cloudvps-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [21:10:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:41:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 14 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [22:48:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 12 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [22:56:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [23:15:19] 10Quarry: Remove redis - https://phabricator.wikimedia.org/T360584 (10rook) 03NEW [23:26:28] (PuppetStaleCertificates) firing: (2) Found non-revoked Puppet certificates for 695 deleted instances on cloudinfra-cloudvps-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [23:28:50] (ProbeDown) firing: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:33:50] (ProbeDown) resolved: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown