[00:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [00:44:14] (03CR) 10Andrew Bogott: "recheck" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott) [01:22:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:27:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:59:52] 10Tool-Pageviews: Improve instructions for Massviews to explain how to combine queries - https://phabricator.wikimedia.org/T356953 (10John_Cummings) [02:02:32] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10Raymond_Ndibe) >>! In T353740#9520557, @dcaro wrote: >>>! In T353740#9519630, @Raymond_Ndibe wrote: >>>>! In T353740#9429362, @dcaro wro... [03:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [03:37:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:42:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:15:57] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge... [04:17:04] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/build... [04:18:13] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10CodeReviewBot) raymond-ndibe updated https://gitlab.wikimedia.org/repos/cloud/toolforge/buil... [06:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [06:18:56] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:23:56] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:36:22] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10Raymond_Ndibe) [06:37:12] 10Toolforge Build Service, 10Documentation: [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Raymond_Ndibe) [06:37:35] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10User-Raymond_Ndibe: [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10Raymond_Ndibe) 05Open→03In progress [08:56:42] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-42 [08:57:20] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-42 [08:57:31] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-43 [08:58:08] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-43 [08:58:14] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:59:01] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [09:00:19] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:00:48] 10Toolforge (Toolforge iteration 05): Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10dcaro) p:05Triage→03Medium [09:00:51] 10Grid-Engine-to-K8s-Migration: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro) [09:01:13] 10Toolforge (Toolforge iteration 05): [webservice] Add health probes for port 8080 - https://phabricator.wikimedia.org/T356907 (10dcaro) 05duplicate→03Resolved [09:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [09:04:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-43 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:07:41] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi) [09:09:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-43 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:09:59] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-10.tools.eqiad1.wikimedia.cloud to the cluster [09:09:59] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:10:16] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-44 [09:10:54] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-44 [09:11:21] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-45 [09:11:57] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-45 [09:13:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:16:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-43 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:21:03] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-11.tools.eqiad1.wikimedia.cloud to the cluster [09:21:03] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:21:27] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-46 [09:21:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-43 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:22:03] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-46 [09:22:12] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-47 [09:22:49] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-47 [09:23:38] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:29:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-47 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:32:52] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-12.tools.eqiad1.wikimedia.cloud to the cluster [09:32:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:33:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-48 [09:33:40] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-48 [09:33:45] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-49 [09:34:20] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-49 [09:34:28] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-44 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:35:51] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:37:41] (CloudVPSDesignateLeaks) firing: Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:42:41] (CloudVPSDesignateLeaks) firing: (2) Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:46:10] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-13.tools.eqiad1.wikimedia.cloud to the cluster [09:46:10] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [10:02:00] 10Toolforge: toolforge k8s control plane freezing and other stability issues - https://phabricator.wikimedia.org/T333922 (10aborrero) [10:02:36] 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge kubernetes: create roll-reboot cookbook - https://phabricator.wikimedia.org/T333379 (10aborrero) 05Stalled→03Resolved a:03aborrero We have a working cookbook now `wmcs.toolforge.k8s.reboot`. [10:05:48] 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge k8s automation: introduce option to force reboot each node directly - https://phabricator.wikimedia.org/T356969 (10aborrero) [10:06:38] (03CR) 10David Caro: [C: 04-1] "I think there was some confusion on what the errors we were seeing were." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [10:07:15] 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge k8s automation: introduce option to force reboot each node directly - https://phabricator.wikimedia.org/T356969 (10aborrero) [10:08:25] 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge k8s automation: introduce option to reboot a node if the uptime is higher than XYZ - https://phabricator.wikimedia.org/T356970 (10aborrero) [10:11:06] 10Toolforge (Toolforge iteration 05): [bump-version.sh,builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972 (10dcaro) [10:11:15] 10Toolforge (Toolforge iteration 05): [bump-version.sh,builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972 (10dcaro) p:05Triage→03Low [10:12:16] (03CR) 10Arturo Borrero Gonzalez: "What about this idea instead: https://phabricator.wikimedia.org/T356970" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [10:13:01] (03CR) 10Arturo Borrero Gonzalez: "this should allow us to operate with this cookbook in a more convenient way." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [10:17:02] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud... [10:18:05] 10Toolforge Jobs framework, 10User-aborrero: Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575 (10aborrero) [10:19:20] 10Toolforge Jobs framework, 10User-aborrero: Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575 (10aborrero) p:05Triage→03Medium a:03aborrero I always had this in the radar. I will be happy to introduce support for this. [10:27:24] 10Toolforge: [OpenAPI] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974 (10dcaro) [10:30:06] 10Toolforge: [OpenAPI] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974 (10dcaro) p:05Triage→03High [10:35:05] (03CR) 10David Caro: "I think that's a good idea, though maybe instead of nodes with uptime > X, do something like "nodes that did not reboot since 'YYYY-mm-dd " [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [10:39:37] 10Cloud Services Proposals, 10cloud-services-team (FY2023/2024-Q3-Q4): Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 (10fnegri) p:05Triage→03Medium [10:49:23] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-worker-10 [10:49:54] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host toolsbeta-test-k8s-worker-10 [10:53:54] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster [11:01:01] !log taavi@cloudcumin1001 toolsbeta Added a new k8s worker-nfs toolsbeta-test-k8s-worker-nfs-2.toolsbeta.eqiad1.wikimedia.cloud to the cluster [11:01:01] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the toolsbeta cluster [11:03:25] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10dcaro) > I think I got what you mean @dcaro. > however I searched for the our last decision request on gitlab workflow and this was what... [11:05:55] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeat-test-k8s-worker-6 [11:05:55] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host toolsbeat-test-k8s-worker-6 [11:06:01] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-worker-6 [11:06:34] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host toolsbeta-test-k8s-worker-6 [11:09:38] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:22:36] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster [11:25:31] (03CR) 10David Caro: [C: 03+1] "LGTM, though https://phabricator.wikimedia.org/T356970 might make it not necessary, as in we might want to have some graceful and non-gra" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott) [11:30:12] !log taavi@cloudcumin1001 toolsbeta Added a new k8s worker-nfs toolsbeta-test-k8s-worker-nfs-3.toolsbeta.eqiad1.wikimedia.cloud to the cluster [11:30:12] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the toolsbeta cluster [11:36:22] 10Cloud-VPS, 10Toolforge: define a prebaked way to temporarily disable access for a user or Tool - https://phabricator.wikimedia.org/T147242 (10dcaro) >>! In T147242#9522537, @Andrew wrote: > The topic of this doesn't quite fit with the initial description. I /think/ that T170355 is the same ask (and it's don... [11:37:29] 10Cloud-VPS, 10cloud-services-team: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) p:05Triage→03Low [11:42:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:43:35] 10Grid-Engine-to-K8s-Migration: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro) [11:43:38] 10Toolforge (Toolforge iteration 05): Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10dcaro) [11:56:28] 10Cloud-VPS, 10cloud-services-team: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) One thing I hadn't considered in the above was traffic to the 10.x WMF ranges from VMs. That all get's NATed as is. My instinct is it's possibly simplest t... [12:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [12:07:06] 10Grid-Engine-to-K8s-Migration, 10User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro) [12:12:28] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:14:33] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) I'm still seeing >1k connections open from rustbot: ` root@tools-k8s-worker-101:~# lsof -p 10767 | grep TCP | wc 1009 10090 84756 ` [12:22:53] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) Maybe the connections are not being closed properly? or there's some leak in the pool counter or something? [12:25:02] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10dcaro) I think it's likely related though, it's quite possible that doing the snaptrim forces ceph to increase the IO on sectors of... [12:58:52] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi) a:03taavi [13:03:01] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [13:03:04] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [13:06:31] (ToolsGridQueueProblem) resolved: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [13:13:18] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudweb2002-dev.wikimedia.org with OS bullseye [13:21:29] 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10MBH) @komla please, give me one more month to transfer, as you say it's possible on email on Feb 6. [13:47:58] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Do not NAT traffic to cloud-private - https://phabricator.wikimedia.org/T356850 (10taavi) 05Open→03Resolved [13:49:17] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi) [13:51:28] 10Toolforge, 10Kubernetes: [Toolforge] Generic webservice not working on Kubernetes - https://phabricator.wikimedia.org/T277749 (10fnegri) 05Open→03Declined I'm boldly closing this as "Declined", the need for a "generic webservice image" is covered by the newer task {T355231}, which has a patch to create a... [13:55:19] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) 24 hours later, the snapshot has not been deleted yet. So either the snaptrim is taking a really long time, or it didn't ev... [14:09:38] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:19:49] 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudweb2002-dev.wikimedia.org with OS bullseye completed: - clou... [14:24:41] 10Tool-refill: the PMC prefix for a value of pmc attribute should not be added - https://phabricator.wikimedia.org/T357000 (10Maxim_Masiutin) [14:39:01] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) I tried reproducing this on a small empty volume, and I could create and delete a snapshot without issues. So I think the i... [15:12:28] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:15:33] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.refresh_puppet_certs on toolsbeta-test-k8s-worker-nfs-2.toolsbeta.eqiad1.wikimedia.cloud [15:16:49] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on toolsbeta-test-k8s-worker-nfs-2.toolsbeta.eqiad1.wikimedia.cloud [15:18:20] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.refresh_puppet_certs on toolsbeta-test-k8s-worker-nfs-3.toolsbeta.eqiad1.wikimedia.cloud [15:19:34] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on toolsbeta-test-k8s-worker-nfs-3.toolsbeta.eqiad1.wikimedia.cloud [15:24:39] (ProbeDown) resolved: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:31:53] (03PS2) 10Majavah: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) [15:31:59] (03CR) 10Majavah: [C: 03+2] Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah) [15:32:27] (03Merged) 10jenkins-bot: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah) [15:32:44] 10Toolforge, 10Composer, 10Patch-For-Review: Switch Toolforge installation of "composer" to use the Debian package - https://phabricator.wikimedia.org/T287900 (10taavi) 05In progress→03Resolved [15:34:32] 10Toolforge, 10cloud-services-team (Kanban), 10Patch-For-Review: Toolforge grid deployment/management automation - https://phabricator.wikimedia.org/T298948 (10taavi) [15:35:22] 10Toolforge, 10cloud-services-team, 10Infrastructure-Foundations, 10SRE-tools: spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10taavi) 05Stalled→03Declined [15:37:28] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:38:11] 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Create Bookworm-based standalone webservice image - https://phabricator.wikimedia.org/T355231 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/12 Ad... [15:40:48] (PuppetFailure) firing: Puppet has failed on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:40:58] 10cloud-services-team: PuppetFailure Puppet failure on cloudweb2002-dev:9100 - https://phabricator.wikimedia.org/T357017 (10phaultfinder) [15:42:28] (PuppetAgentNoResources) resolved: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:01:18] (PuppetFailure) resolved: Puppet has failed on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:01:58] 10cloud-services-team: PuppetFailure Puppet failure on cloudweb2002-dev:9100 - https://phabricator.wikimedia.org/T357017 (10taavi) 05Open→03Resolved a:03taavi [16:22:01] (03CR) 10Andrew Bogott: [C: 03+2] k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott) [16:24:31] (03CR) 10Andrew Bogott: "I like Arturo's suggestion, but regardless of the re-run application I would really like this (or, really any cookbook that operates on a " [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [16:27:06] 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi) [16:28:57] (03Merged) 10jenkins-bot: k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott) [16:40:16] 10Cloud-VPS, 10cloud-services-team, 10SRE: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10cmooney) 05Open→03Declined Closing this one, let's discuss on duplicate T356986 (sorry bout that!) [16:44:10] 10Cloud-VPS, 10cloud-services-team: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) [16:49:04] 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10aborrero) [16:54:58] 10cloud-services-team, 10wikitech.wikimedia.org, 10LDAP, 10SecTeam-Processed, 10Security: Problem with logging into my developer account. - https://phabricator.wikimedia.org/T356958 (10sbassett) p:05Triage→03Low [17:05:11] 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10rook) [17:06:32] 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/372 [17:06:44] vivian-rook opened https://github.com/toolforge/paws/pull/372 [17:15:44] 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10KHurd-WMF) {F41816993} {F41817016} These are the current settings I have, I'll take any help. [17:23:45] 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) [17:30:21] 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) [17:31:23] (03PS1) 10Btullis: Move #data-platform-sre announcements to a dedicated channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) [17:45:01] 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10bd808) Per the diagram on https://hackertarget.com/openvas-tutorial-tips/ it looks like the web gui for OpenVAS runs on port 9392 by default. That port is active on openvasv1.openvas.eqiad1.wikime... [17:45:34] (03CR) 10Bking: Move #data-platform-sre announcements to a dedicated channel (031 comment) [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) (owner: 10Btullis) [17:55:47] (03PS2) 10Btullis: Move #data-platform-sre announcements to a dedicated channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) [17:56:28] (03CR) 10Btullis: Move #data-platform-sre announcements to a dedicated channel (031 comment) [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) (owner: 10Btullis) [18:14:56] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [18:15:31] !log andrew@cloudcumin1001 cloudvirt-canary END (ERROR) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=97) [18:15:59] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [18:17:14] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [18:23:30] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [18:26:18] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [18:32:10] (CephSlowOps) firing: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [18:37:09] (CephSlowOps) resolved: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:15:59] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [19:17:42] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [19:29:20] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [19:31:02] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [19:44:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:46:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [22:38:11] 10Toolforge, 10User-bd808: Is wikibugs alive? - https://phabricator.wikimedia.org/T357076 (10bd808) blah blah blah [22:40:43] 10Toolforge, 10User-bd808: Is wikibugs alive? - https://phabricator.wikimedia.org/T357076 (10bd808) 05Open→03Resolved a:03bd808 yup [22:51:07] (03PS1) 10Majavah: Bump stylelint to 15.10.1 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/999115