[00:01:31] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[00:44:14] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott)
[01:22:22] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[01:27:22] <jinxer-wm>	 (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[01:59:52] <wikibugs>	 10Tool-Pageviews: Improve instructions for Massviews to explain how to combine queries - https://phabricator.wikimedia.org/T356953 (10John_Cummings)
[02:02:32] <wikibugs>	 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10Raymond_Ndibe) >>! In T353740#9520557, @dcaro wrote: >>>! In T353740#9519630, @Raymond_Ndibe wrote: >>>>! In T353740#9429362, @dcaro wro...
[03:01:31] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[03:37:22] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[03:42:22] <jinxer-wm>	 (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[04:15:57] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge...
[04:17:04] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/build...
[04:18:13] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10CodeReviewBot) raymond-ndibe updated https://gitlab.wikimedia.org/repos/cloud/toolforge/buil...
[06:01:31] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[06:18:56] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[06:23:56] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[06:36:22] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10Raymond_Ndibe)
[06:37:12] <wikibugs>	 10Toolforge Build Service, 10Documentation: [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Raymond_Ndibe)
[06:37:35] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10User-Raymond_Ndibe: [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10Raymond_Ndibe) 05Open→03In progress
[08:56:42] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-42
[08:57:20] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-42
[08:57:31] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-43
[08:58:08] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-43
[08:58:14] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[08:59:01] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[09:00:19] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[09:00:48] <wikibugs>	 10Toolforge (Toolforge iteration 05): Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10dcaro) p:05Triage→03Medium
[09:00:51] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro)
[09:01:13] <wikibugs>	 10Toolforge (Toolforge iteration 05): [webservice] Add health probes for port 8080 - https://phabricator.wikimedia.org/T356907 (10dcaro) 05duplicate→03Resolved
[09:01:31] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[09:04:28] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-k8s-worker-43 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:07:41] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi)
[09:09:28] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-k8s-worker-43 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:09:59] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-10.tools.eqiad1.wikimedia.cloud to the cluster
[09:09:59] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[09:10:16] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-44
[09:10:54] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-44
[09:11:21] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-45
[09:11:57] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-45
[09:13:05] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[09:16:58] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-43 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:21:03] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-11.tools.eqiad1.wikimedia.cloud to the cluster
[09:21:03] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[09:21:27] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-46
[09:21:58] <wmcs-alerts>	 (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-43 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:22:03] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-46
[09:22:12] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-47
[09:22:49] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-47
[09:23:38] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[09:29:28] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-k8s-worker-47 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:32:52] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-12.tools.eqiad1.wikimedia.cloud to the cluster
[09:32:52] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[09:33:05] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-48
[09:33:40] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-48
[09:33:45] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-49
[09:34:20] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-49
[09:34:28] <wmcs-alerts>	 (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-44 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:35:51] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[09:37:41] <jinxer-wm>	 (CloudVPSDesignateLeaks) firing: Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[09:42:41] <jinxer-wm>	 (CloudVPSDesignateLeaks) firing: (2) Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[09:46:10] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-13.tools.eqiad1.wikimedia.cloud to the cluster
[09:46:10] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[10:02:00] <wikibugs>	 10Toolforge: toolforge k8s control plane freezing and other stability issues - https://phabricator.wikimedia.org/T333922 (10aborrero)
[10:02:36] <wikibugs>	 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge kubernetes: create roll-reboot cookbook - https://phabricator.wikimedia.org/T333379 (10aborrero) 05Stalled→03Resolved a:03aborrero We have a working cookbook now `wmcs.toolforge.k8s.reboot`.
[10:05:48] <wikibugs>	 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge k8s automation: introduce option to force reboot each node directly - https://phabricator.wikimedia.org/T356969 (10aborrero)
[10:06:38] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] "I think there was some confusion on what the errors we were seeing were." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott)
[10:07:15] <wikibugs>	 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge k8s automation: introduce option to force reboot each node directly - https://phabricator.wikimedia.org/T356969 (10aborrero)
[10:08:25] <wikibugs>	 10Toolforge, 10cloud-services-team, 10User-aborrero: toolforge k8s automation: introduce option to reboot a node if the uptime is higher than XYZ - https://phabricator.wikimedia.org/T356970 (10aborrero)
[10:11:06] <wikibugs>	 10Toolforge (Toolforge iteration 05): [bump-version.sh,builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972 (10dcaro)
[10:11:15] <wikibugs>	 10Toolforge (Toolforge iteration 05): [bump-version.sh,builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972 (10dcaro) p:05Triage→03Low
[10:12:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "What about this idea instead: https://phabricator.wikimedia.org/T356970" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott)
[10:13:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "this should allow us to operate with this cookbook in a more convenient way." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott)
[10:17:02] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud...
[10:18:05] <wikibugs>	 10Toolforge Jobs framework, 10User-aborrero: Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575 (10aborrero)
[10:19:20] <wikibugs>	 10Toolforge Jobs framework, 10User-aborrero: Allow exporting jobs list in YAML format - https://phabricator.wikimedia.org/T320575 (10aborrero) p:05Triage→03Medium a:03aborrero I always had this in the radar. I will be happy to introduce support for this.
[10:27:24] <wikibugs>	 10Toolforge: [OpenAPI] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974 (10dcaro)
[10:30:06] <wikibugs>	 10Toolforge: [OpenAPI] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974 (10dcaro) p:05Triage→03High
[10:35:05] <wikibugs>	 (03CR) 10David Caro: "I think that's a good idea, though maybe instead of nodes with uptime > X, do something like "nodes that did not reboot since 'YYYY-mm-dd " [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott)
[10:39:37] <wikibugs>	 10Cloud Services Proposals, 10cloud-services-team (FY2023/2024-Q3-Q4): Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 (10fnegri) p:05Triage→03Medium
[10:49:23] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-worker-10
[10:49:54] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host toolsbeta-test-k8s-worker-10
[10:53:54] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster
[11:01:01] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta Added a new k8s worker-nfs toolsbeta-test-k8s-worker-nfs-2.toolsbeta.eqiad1.wikimedia.cloud to the cluster
[11:01:01] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the toolsbeta cluster
[11:03:25] <wikibugs>	 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10dcaro) > I think I got what you mean @dcaro. > however I searched for the our last decision request on gitlab workflow and this was what...
[11:05:55] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeat-test-k8s-worker-6
[11:05:55] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host toolsbeat-test-k8s-worker-6
[11:06:01] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_node for host toolsbeta-test-k8s-worker-6
[11:06:34] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host toolsbeta-test-k8s-worker-6
[11:09:38] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[11:22:36] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster
[11:25:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, though  https://phabricator.wikimedia.org/T356970 might make it not necessary, as in we might want to have some graceful and non-gra" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott)
[11:30:12] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta Added a new k8s worker-nfs toolsbeta-test-k8s-worker-nfs-3.toolsbeta.eqiad1.wikimedia.cloud to the cluster
[11:30:12] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the toolsbeta cluster
[11:36:22] <wikibugs>	 10Cloud-VPS, 10Toolforge: define a prebaked way to temporarily disable access for a user or Tool - https://phabricator.wikimedia.org/T147242 (10dcaro) >>! In T147242#9522537, @Andrew wrote: > The topic of this doesn't quite fit with the initial description.  I /think/ that T170355 is the same ask (and it's don...
[11:37:29] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) p:05Triage→03Low
[11:42:28] <wmcs-alerts>	 (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[11:43:35] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro)
[11:43:38] <wikibugs>	 10Toolforge (Toolforge iteration 05): Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10dcaro)
[11:56:28] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney) One thing I hadn't considered in the above was traffic to the 10.x WMF ranges from VMs.  That all get's NATed as is.  My instinct is it's possibly simplest t...
[12:01:31] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[12:07:06] <wikibugs>	 10Grid-Engine-to-K8s-Migration, 10User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro)
[12:12:28] <wmcs-alerts>	 (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[12:14:33] <wikibugs>	 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) I'm still seeing >1k connections open from rustbot: ` root@tools-k8s-worker-101:~# lsof -p 10767 | grep TCP | wc    1009   10090   84756 `
[12:22:53] <wikibugs>	 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) Maybe the connections are not being closed properly? or there's some leak in the pool counter or something?
[12:25:02] <wikibugs>	 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10dcaro) I think it's likely related though, it's quite possible that doing the snaptrim forces ceph to increase the IO on sectors of...
[12:58:52] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi) a:03taavi
[13:03:01] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors
[13:03:04] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0)
[13:06:31] <wmcs-alerts>	 (ToolsGridQueueProblem) resolved: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[13:13:18] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudweb2002-dev.wikimedia.org with OS bullseye
[13:21:29] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10MBH) @komla please, give me one more month to transfer, as you say it's possible on email on Feb 6.
[13:47:58] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Do not NAT traffic to cloud-private - https://phabricator.wikimedia.org/T356850 (10taavi) 05Open→03Resolved
[13:49:17] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi)
[13:51:28] <wikibugs>	 10Toolforge, 10Kubernetes: [Toolforge] Generic webservice not working on Kubernetes - https://phabricator.wikimedia.org/T277749 (10fnegri) 05Open→03Declined I'm boldly closing this as "Declined", the need for a "generic webservice image" is covered by the newer task {T355231}, which has a patch to create a...
[13:55:19] <wikibugs>	 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) 24 hours later, the snapshot has not been deleted yet. So either the snaptrim is taking a really long time, or it didn't ev...
[14:09:38] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[14:19:49] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudweb2002-dev.wikimedia.org with OS bullseye completed: - clou...
[14:24:41] <wikibugs>	 10Tool-refill: the PMC prefix for a value of pmc attribute should not be added - https://phabricator.wikimedia.org/T357000 (10Maxim_Masiutin)
[14:39:01] <wikibugs>	 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) I tried reproducing this on a small empty volume, and I could create and delete a snapshot without issues. So I think the i...
[15:12:28] <wmcs-alerts>	 (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[15:15:33] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.refresh_puppet_certs on toolsbeta-test-k8s-worker-nfs-2.toolsbeta.eqiad1.wikimedia.cloud
[15:16:49] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on toolsbeta-test-k8s-worker-nfs-2.toolsbeta.eqiad1.wikimedia.cloud
[15:18:20] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.refresh_puppet_certs on toolsbeta-test-k8s-worker-nfs-3.toolsbeta.eqiad1.wikimedia.cloud
[15:19:34] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on toolsbeta-test-k8s-worker-nfs-3.toolsbeta.eqiad1.wikimedia.cloud
[15:24:39] <wmcs-alerts>	 (ProbeDown) resolved: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[15:31:53] <wikibugs>	 (03PS2) 10Majavah: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231)
[15:31:59] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah)
[15:32:27] <wikibugs>	 (03Merged) 10jenkins-bot: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah)
[15:32:44] <wikibugs>	 10Toolforge, 10Composer, 10Patch-For-Review: Switch Toolforge installation of "composer" to use the Debian package - https://phabricator.wikimedia.org/T287900 (10taavi) 05In progress→03Resolved
[15:34:32] <wikibugs>	 10Toolforge, 10cloud-services-team (Kanban), 10Patch-For-Review: Toolforge grid deployment/management automation - https://phabricator.wikimedia.org/T298948 (10taavi)
[15:35:22] <wikibugs>	 10Toolforge, 10cloud-services-team, 10Infrastructure-Foundations, 10SRE-tools: spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10taavi) 05Stalled→03Declined
[15:37:28] <wmcs-alerts>	 (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[15:38:11] <wikibugs>	 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Create Bookworm-based standalone webservice image - https://phabricator.wikimedia.org/T355231 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/12  Ad...
[15:40:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:40:58] <wikibugs>	 10cloud-services-team: PuppetFailure  Puppet failure on cloudweb2002-dev:9100 - https://phabricator.wikimedia.org/T357017 (10phaultfinder)
[15:42:28] <wmcs-alerts>	 (PuppetAgentNoResources) resolved: (2) No Puppet resources found on instance toolsbeta-test-k8s-worker-nfs-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[16:01:18] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:01:58] <wikibugs>	 10cloud-services-team: PuppetFailure  Puppet failure on cloudweb2002-dev:9100 - https://phabricator.wikimedia.org/T357017 (10taavi) 05Open→03Resolved a:03taavi
[16:22:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott)
[16:24:31] <wikibugs>	 (03CR) 10Andrew Bogott: "I like Arturo's suggestion, but regardless of the re-run application I would really like this (or, really any cookbook that operates on a " [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott)
[16:27:06] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi)
[16:28:57] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott)
[16:40:16] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10SRE: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10cmooney) 05Open→03Declined Closing this one, let's discuss on duplicate T356986 (sorry bout that!)
[16:44:10] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney)
[16:49:04] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10aborrero)
[16:54:58] <wikibugs>	 10cloud-services-team, 10wikitech.wikimedia.org, 10LDAP, 10SecTeam-Processed, 10Security: Problem with logging into my developer account. - https://phabricator.wikimedia.org/T356958 (10sbassett) p:05Triage→03Low
[17:05:11] <wikibugs>	 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10rook)
[17:06:32] <wikibugs>	 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/372
[17:06:44] <notefromgithub>	 vivian-rook opened https://github.com/toolforge/paws/pull/372
[17:15:44] <wikibugs>	 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10KHurd-WMF) {F41816993}  {F41817016}  These are the current settings I have, I'll take any help.
[17:23:45] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney)
[17:30:21] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10cmooney)
[17:31:23] <wikibugs>	 (03PS1) 10Btullis: Move #data-platform-sre announcements to a dedicated channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783)
[17:45:01] <wikibugs>	 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10bd808) Per the diagram on https://hackertarget.com/openvas-tutorial-tips/ it looks like the web gui for OpenVAS runs on port 9392 by default. That port is active on openvasv1.openvas.eqiad1.wikime...
[17:45:34] <wikibugs>	 (03CR) 10Bking: Move #data-platform-sre announcements to a dedicated channel (031 comment) [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) (owner: 10Btullis)
[17:55:47] <wikibugs>	 (03PS2) 10Btullis: Move #data-platform-sre announcements to a dedicated channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783)
[17:56:28] <wikibugs>	 (03CR) 10Btullis: Move #data-platform-sre announcements to a dedicated channel (031 comment) [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) (owner: 10Btullis)
[18:14:56] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[18:15:31] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (ERROR) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=97)
[18:15:59] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[18:17:14] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99)
[18:23:30] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[18:26:18] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[18:32:10] <jinxer-wm>	 (CephSlowOps) firing: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[18:37:09] <jinxer-wm>	 (CephSlowOps) resolved: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[19:15:59] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[19:17:42] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[19:29:20] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[19:31:02] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[19:44:16] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack
[19:46:19] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0)
[22:38:11] <wikibugs>	 10Toolforge, 10User-bd808: Is wikibugs alive? - https://phabricator.wikimedia.org/T357076 (10bd808) blah blah blah
[22:40:43] <wikibugs>	 10Toolforge, 10User-bd808: Is wikibugs alive? - https://phabricator.wikimedia.org/T357076 (10bd808) 05Open→03Resolved a:03bd808 yup
[22:51:07] <wikibugs>	 (03PS1) 10Majavah: Bump stylelint to 15.10.1 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/999115