[00:07:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:07:32] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-dcaro: `toolforge build logs`: add follow options - https://phabricator.wikimedia.org/T339922 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/46 build logs: add follow option [00:08:08] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-dcaro: `toolforge build logs`: add follow options - https://phabricator.wikimedia.org/T339922 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/12 [build.logs]: add --follow o... [00:08:15] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.build.logs] Show a more user-friendly error message when logs are not ready - https://phabricator.wikimedia.org/T341059 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/12 [build... [00:08:18] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: toolforge build start: default to tailing the build as it progresses with the option of -d/--detached - https://phabricator.wikimedia.org/T340079 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/too... [00:09:34] (SystemdUnitDown) resolved: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:12:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:23:44] (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:24:44] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-dcaro: `toolforge build logs`: add follow options - https://phabricator.wikimedia.org/T339922 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/142 builds-api: bump to 0... [00:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:54:49] 10Toolforge: Enhance toolforge templatetransclusioncheck to optionally exclude noinclude'd links - https://phabricator.wikimedia.org/T352730 (10Mathglot) [01:01:00] 10Toolforge: Enhance toolforge templatetransclusioncheck to optionally exclude noinclude'd links - https://phabricator.wikimedia.org/T352730 (10Mathglot) [01:02:11] 10Toolforge: Enhance toolforge templatetransclusioncheck to optionally exclude noinclude'd links - https://phabricator.wikimedia.org/T352730 (10Mathglot) [01:05:16] 10Toolforge: Enhance toolforge templatetransclusioncheck to optionally exclude noinclude'd links - https://phabricator.wikimedia.org/T352730 (10Mathglot) [01:10:35] 10Toolforge: Enhance toolforge templatetransclusioncheck to optionally exclude noinclude'd links - https://phabricator.wikimedia.org/T352730 (10Mathglot) [01:12:36] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-dcaro: `toolforge build logs`: add follow options - https://phabricator.wikimedia.org/T339922 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/142 builds-api: bump to 0... [01:13:30] 10Toolforge Build Service (Beta release), 10User-Raymond_Ndibe, 10User-dcaro: Add a way to wait for a Toolforge build to finish - https://phabricator.wikimedia.org/T337043 (10Raymond_Ndibe) [01:13:32] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-dcaro: `toolforge build logs`: add follow options - https://phabricator.wikimedia.org/T339922 (10Raymond_Ndibe) 05In progress→03Resolved [01:14:12] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: toolforge build start: default to tailing the build as it progresses with the option of -d/--detached - https://phabricator.wikimedia.org/T340079 (10Raymond_Ndibe) 05Stalled→03Resolved [01:14:32] 10Toolforge (Toolforge iteration 02): [envvars-cli] use toolforge-weld for error handling - https://phabricator.wikimedia.org/T351459 (10Raymond_Ndibe) 05In progress→03Resolved [02:11:53] 10Tools: Enhance toolforge templatetransclusioncheck to optionally exclude noinclude'd links - https://phabricator.wikimedia.org/T352730 (10taavi) [02:48:25] 10Grid-Engine-to-K8s-Migration, 10User-Huji: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319800 (10Huji) @nskaggs as of a few minutes ago, I have emptied my crontab which means my grid-based jobs will not be running anymore. All the jobs have been migrate... [02:55:19] 10Grid-Engine-to-K8s-Migration: Migrate convert from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319646 (10bd808) I have pinged @Rillke on commons to see if they would like my help in moving to k8s: https://commons.wikimedia.org/wiki/User_talk:Rillke/Discuss#Moving_https://c... [03:04:37] (CephSlowOps) firing: Ceph cluster in eqiad has 9 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [03:04:53] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [03:08:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:09:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [03:10:07] (CephSlowOps) firing: Ceph cluster in eqiad has 16 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [03:10:12] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [03:13:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:14:52] (CephSlowOps) resolved: Ceph cluster in eqiad has 16 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [03:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:02:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:22:03] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:32:03] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:39:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:51:29] 10Cloud-VPS (Quota-requests): Please delete meet and chat VPS projects - https://phabricator.wikimedia.org/T352727 (10Legoktm) Is there any data in chat that needs to be archived/saved somewhere? [04:57:03] (PuppetAgentFailure) resolved: (3) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:58:03] (InstanceDown) firing: (3) Project tools instance tools-sgeweblight-10-17 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:02:36] 10Grid-Engine-to-K8s-Migration: Migrate potd from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319974 (10Legoktm) a:05zhuyifei1999→03Legoktm [05:03:03] (InstanceDown) resolved: (3) Project tools instance tools-sgeweblight-10-17 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:09:17] 10Grid-Engine-to-K8s-Migration: Migrate legobot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319854 (10Legoktm) I will probably need past the December 14th initial deadline to migrate this tool since most of the code is pretty legacy. [05:16:33] 10Grid-Engine-to-K8s-Migration: Migrate extreg-wos from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319737 (10Legoktm) 05Open→03Resolved I've stopped the updating job and marked the tool as no longer updating. [05:16:41] 10Grid-Engine-to-K8s-Migration: Migrate extreg-wos from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319737 (10Legoktm) a:05Reedy→03Legoktm [05:24:31] 10Grid-Engine-to-K8s-Migration: Migrate potd from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319974 (10Legoktm) Mostly to remind myself, the source code is at https://github.com/toollabs/daily-image-l but there are uncommitted changes on the tool itself. It's still in Pytho... [06:27:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:32:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:04:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:04:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:04:56] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:07:06] 10superset.wmcloud.org, 10Wikimedia-production-error: superset.wmcloud.org returns 500 error - https://phabricator.wikimedia.org/T352738 (10MarioGom) [08:20:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-25 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:40:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal, 10Patch-For-Review: Support 'unmanaged' projects in cloud-vps - https://phabricator.wikimedia.org/T326818 (10dcaro) Can we come up with a "support level" for these VMs and make sure it's clearly stated somewhere? Given that they are completel... [09:05:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-25 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:07:23] !log toolsbeta dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:10:35] !log toolsbeta dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:11:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:12:20] 10Toolforge (Toolforge iteration 02): [tbs.build.logs] Show a more user-friendly error message when logs are not ready - https://phabricator.wikimedia.org/T341059 (10dcaro) 05Stalled→03In progress [09:19:08] 10Toolforge (Toolforge iteration 02): Add command/arguments to allow a script to wait on build completion/failure - https://phabricator.wikimedia.org/T352561 (10dcaro) Maybe we can use part of {T339922}? Just note that the main case will be replaced by {T341065}. [09:19:23] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10dcaro) [09:19:36] 10Toolforge (Toolforge iteration 02): Give builds-api access to system admin credentials - https://phabricator.wikimedia.org/T352007 (10dcaro) 05Invalid→03Resolved [09:40:25] 10Grid-Engine-to-K8s-Migration: Migrate commons-android-app from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319638 (10Kaartic) a:05Madhurgupta10→03whym [09:48:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:53:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:00:17] 10Toolforge (Toolforge iteration 02): [tbs][builder] Explore adding support for third-party buildpacks - https://phabricator.wikimedia.org/T352389 (10dcaro) I think that is only to create the builder separatedly, to do that on the fly the spec uses the `project.toml` file: https://github.com/buildpacks/spec/blob... [10:12:22] 10Toolforge (Toolforge iteration 02): Add command/arguments to allow a script to wait on build completion/failure - https://phabricator.wikimedia.org/T352561 (10Slst2020) [10:12:54] 10Toolforge Build Service (Beta release), 10User-Raymond_Ndibe, 10User-dcaro: Add a way to wait for a Toolforge build to finish - https://phabricator.wikimedia.org/T337043 (10Slst2020) [10:13:29] 10Toolforge Build Service (Beta release): [buildservice] Cache .m2 folder (local maven repository) between builds - https://phabricator.wikimedia.org/T350307 (10dcaro) There is "some" support for caching in tekton too (https://github.com/tektoncd/catalog/blob/main/task/buildpacks-phases/0.2/buildpacks-phases.yam... [10:15:58] 10Toolforge (Toolforge iteration 02), 10Toolforge Build Service (Beta release), 10User-Raymond_Ndibe, 10User-dcaro: Add a way to wait for a Toolforge build to finish - https://phabricator.wikimedia.org/T337043 (10Slst2020) [10:17:29] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10dcaro) > Harbor does not come with any default quotas, but we have the default project quota set to 1GB. This was likely done manually, as maintain-harbor does not set any storage lim... [10:19:41] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10dcaro) For the docs this might be {T329176} [10:20:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-25 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:23:03] 10Toolforge (Toolforge iteration 02): [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10dcaro) This should include some instructions on how to proceed, if {T341067} is done first, then running that, and if not enough/does not work, create... [10:25:00] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [tbs.maintain-harbor] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176 (10dcaro) [10:29:08] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [tbs.maintain-harbor] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176 (10dcaro) [10:29:24] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [tbs.maintain-harbor] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176 (10dcaro) [10:30:05] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10dcaro) > For the docs on the current quota setup this might be T329176: [tbs.maintain-harbor] Document current setup and admin procedures Nope, that's just maintain-harbor, the harbo... [10:59:32] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [tools-sgeweblight-10-25] puppet throws segmentation fault - https://phabricator.wikimedia.org/T352753 (10dcaro) p:05Triage→03High [11:00:27] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [tools-sgeweblight-10-25] puppet throws segmentation fault - https://phabricator.wikimedia.org/T352753 (10dcaro) [11:01:02] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [tools-sgeweblight-10-25] puppet throws segmentation fault - https://phabricator.wikimedia.org/T352753 (10dcaro) A second run fails with failed to allocate memory (common issue): ` r... [11:05:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-25 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [11:07:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-25 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:15:16] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:15:23] !log tools dcaro@urcuchillay END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [11:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:20:01] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:20:10] !log tools dcaro@urcuchillay END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [11:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:20:16] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:20:21] !log tools dcaro@urcuchillay END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [11:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:20:29] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:20:49] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:21:31] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:25:00] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:27:03] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-25 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:32:18] 10Cloud-VPS (Quota-requests): Please delete meet and chat VPS projects - https://phabricator.wikimedia.org/T352727 (10Ladsgroup) >>! In T352727#9381649, @Legoktm wrote: > Is there any data in chat that needs to be archived/saved somewhere? I don't think so, all private data must be deleted after three months. [12:30:38] 10superset.wmcloud.org, 10Wikimedia-production-error: superset.wmcloud.org returns 500 error - https://phabricator.wikimedia.org/T352738 (10rook) The trove database has disappeared. "Disappeared" doesn't sound quite right, but it was there, and now it is not. [13:14:56] 10Toolforge (Toolforge iteration 02): Add command/arguments to allow a script to wait on build completion/failure - https://phabricator.wikimedia.org/T352561 (10dcaro) 05duplicate→03Resolved [13:20:00] 10Toolforge (Toolforge iteration 02): [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762 (10Slst2020) [13:25:07] 10Toolforge (Toolforge iteration 02): [tbs] cleanup robot account related code - https://phabricator.wikimedia.org/T352763 (10Raymond_Ndibe) [13:33:16] 10Toolforge (Toolforge iteration 02): [tbs] Add dashboards with the new statistics - https://phabricator.wikimedia.org/T352764 (10dcaro) [13:37:58] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/11 [13:38:25] 10superset.wmcloud.org: sql backup to rotate after successful backup - https://phabricator.wikimedia.org/T352766 (10rook) [13:40:26] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/11 [13:42:35] 10superset.wmcloud.org: superset.wmcloud.org returns 500 error - https://phabricator.wikimedia.org/T352738 (10rook) 05Open→03Resolved a:03rook [13:43:15] 10Grid-Engine-to-K8s-Migration: Migrate persondata from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319962 (10Wurgl) Q: Currently I am using jlocal for a watchdog process which is checking one relatively critical job and the website for the tool. Is there is replacement? [13:45:36] 10Grid-Engine-to-K8s-Migration: Migrate wikihistory from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320157 (10Wurgl) Wikihistory uses php *and* mono in the same script. Is there a cookbook (for dummies) explaining step by step what to do? [13:52:10] 10Grid-Engine-to-K8s-Migration: Migrate wikihistory from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320157 (10Cyberpower678) I unfortunately, forgot about this tool entirely. Would somebody be willing to take this tool over? [13:52:44] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [tools-sgeweblight-10-25] puppet throws segmentation fault - https://phabricator.wikimedia.org/T352753 (10dcaro) 05Open→03Resolved This is gone after rebooting, so I'll close as... [14:06:08] 10Grid-Engine-to-K8s-Migration: Migrate wikihistory from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320157 (10Wurgl) @cyberpower678 no problem. No additional maintainer is needed, at least for now. [14:22:14] 10Toolforge (Toolforge iteration 02): [tbs, builds-api] change local environment to use admin account - https://phabricator.wikimedia.org/T352770 (10dcaro) [14:50:20] 10Toolforge (Toolforge iteration 02): [builds-builder] Investigate how to enable mono/dotnet/c# and implement the best one to unblock us to migrate tools - https://phabricator.wikimedia.org/T352774 (10dcaro) [14:52:42] 10Cloud-VPS: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) [14:52:57] 10Toolforge (Toolforge iteration 02): [builds-builder] Investigate how to enable mono/dotnet/c# and implement the best one to unblock us to migrate tools - https://phabricator.wikimedia.org/T352774 (10dcaro) [14:58:14] 10Grid-Engine-to-K8s-Migration: Migrate rotbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320015 (10Steinsplitter) 05Open→03In progress a:05zhuyifei1999→03Steinsplitter [14:59:19] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10Steinsplitter) 05Open→03In progress a:05zhuyifei1999→03Steinsplitter [15:02:04] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter2 from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320060 (10Steinsplitter) 05Open→03Resolved [15:03:42] 10Cloud-VPS: Horizon: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10Andrew) [15:03:57] 10Grid-Engine-to-K8s-Migration: Migrate bothasava from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319601 (10Kotz) 05Stalled→03Resolved a:05Uziel302→03Kotz [15:04:13] (03PS1) 10Samtar: channels: Add commtech-kanban to commtech channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/980412 [15:06:02] 10Tools: Grid engine job 3652373 stucking in dr state - https://phabricator.wikimedia.org/T352777 (10Steinsplitter) [15:08:03] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10JJMC89) [15:08:05] 10Tools: Grid engine job 3652373 stucking in dr state - https://phabricator.wikimedia.org/T352777 (10JJMC89) [15:11:16] (03CR) 10Samtar: [C: 03+2] "self+2, should tidy up that regex one day." [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/980412 (owner: 10Samtar) [15:11:51] (03Merged) 10jenkins-bot: channels: Add commtech-kanban to commtech channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/980412 (owner: 10Samtar) [15:16:07] 10Cloud-VPS: Horizon: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10Andrew) a:03Andrew [15:17:50] 10Grid-Engine-to-K8s-Migration: Migrate rotbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320015 (10Steinsplitter) sh: 1: /usr/bin/convert: not found ^^ seems missing on the kubernetes image. This is blocking the migration. [15:47:27] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10TheresNoTime) [15:47:30] 10Tools: Grid engine job 3652373 stucking in dr state - https://phabricator.wikimedia.org/T352777 (10TheresNoTime) 05Open→03Resolved a:03TheresNoTime //it's an ex-job, it has ceased to be.// [15:51:45] 10Cloud-VPS: Horizon: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10Andrew) This is either a quoting/parsing issue or a permissions issue. If I insert the record with quotes like "v=spf1 a:185.15.56.1 ~all" it seems to work for me. I don't yet know if that's an adeq... [16:19:49] 10Cloud-VPS: Horizon: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) >>! In T352713#9383604, @Andrew wrote: > This is either a quoting/parsing issue or a permissions issue. If I insert the record with quotes like "v=spf1 a:185.15.56.1 ~all" it seems to w... [16:20:32] 10Cloud-VPS: Horizon: cannot create/update an SPF DNS record - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) 05Open→03Resolved [16:22:09] 10Grid-Engine-to-K8s-Migration: Migrate rotbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320015 (10Steinsplitter) 05In progress→03Resolved This has now been fixed. Migration done. [16:43:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal, 10Patch-For-Review: Support 'unmanaged' projects in cloud-vps - https://phabricator.wikimedia.org/T326818 (10Andrew) >>! In T326818#9381905, @dcaro wrote: > Can we come up with a "support level" for these VMs and make sure it's clearly stated... [16:52:50] 10Tools: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10Steinsplitter) [16:53:13] 10Tools: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10Steinsplitter) [16:53:15] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10Steinsplitter) [16:55:36] 10Toolforge: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10taavi) a:03taavi [16:57:49] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10taavi) [16:58:28] 10Toolforge: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10taavi) 05Open→03Resolved `lang=shell-session taavi@tools-sgebastion-11:~ $ kubectl sudo delete cm -n tool-steinsplitter maintain-kubeusers configmap "maintain-kubeusers" deleted ` ` st... [16:59:43] 10Grid-Engine-to-K8s-Migration: Migrate crocodylia from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319652 (10Steinsplitter) 05Open→03Resolved [17:00:47] 10Grid-Engine-to-K8s-Migration: Migrate db from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319663 (10Steinsplitter) 05Open→03Resolved [17:10:19] 10Toolforge: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10Steinsplitter) >>! In T352792#9383947, @taavi wrote: > `lang=shell-session > taavi@tools-sgebastion-11:~ $ kubectl sudo delete cm -n tool-steinsplitter maintain-kubeusers > configmap "main... [17:10:29] 10PAWS: New upstream release 8.6 for Pywikibot - https://phabricator.wikimedia.org/T352794 (10Xqt) [17:11:07] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/56 [builds-api] forc... [17:12:55] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10Raymond_Ndibe) 05In progress→03Resolved [17:13:43] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/19 [envvars-api] fo... [17:34:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [17:37:35] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [17:37:37] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [17:38:01] 10PAWS: New upstream release 8.6 for Pywikibot - https://phabricator.wikimedia.org/T352794 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/356 [17:38:06] vivian-rook opened https://github.com/toolforge/paws/pull/356 [17:59:31] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Sprint-2023-11-22: Check technischewuensche tool code and publish in a public repo - https://phabricator.wikimedia.org/T350352 (10Aklapper) Once you manage to tick off "Back up the current state" you'll unblock {T330797} :D [18:24:52] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T352595 (10taavi) 05Open→03Resolved a:03taavi This hasn't happened again, so closing. [18:25:55] 10Toolforge: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10RhinosF1) 05Resolved→03Open a:05taavi→03None [18:26:00] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10RhinosF1) [18:26:46] 10Toolforge (Software install/update): Please install hugin-tools and pillow again - https://phabricator.wikimedia.org/T347446 (10bd808) >>! In T347446#9381358, @tstarling wrote: > Apparently the real solution is to use a buildpack. But I'm not sure if it's worth doing since we're considering productionization o... [18:27:48] 10Toolforge (Software install/update): Please install hugin-tools and pillow again - https://phabricator.wikimedia.org/T347446 (10bd808) [18:27:50] 10Grid-Engine-to-K8s-Migration: Migrate panoviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319953 (10bd808) [18:38:13] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10taavi) [18:38:51] 10Toolforge: Re-create kubernetes configuration files for tools.steinsplitter - https://phabricator.wikimedia.org/T352792 (10taavi) 05Open→03Resolved a:03taavi Looks like there was some bad state left from the tool running on the grid that I've now removed. [18:54:00] 10Grid-Engine-to-K8s-Migration: Migrate steinsplitter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320059 (10Steinsplitter) 05In progress→03Resolved [19:20:53] 10Toolforge, 10cloud-services-team: tools-sgeweblight / drives very full - https://phabricator.wikimedia.org/T352802 (10Andrew) [19:21:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:24:56] (ToolsGridQueueProblem) resolved: Grid queue webgrid-lighttpd@tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [19:26:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:28:37] (CephSlowOps) firing: Ceph cluster in eqiad has 20 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:28:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [19:31:50] (PawsJupyterHubDown) firing: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:32:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:35:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:38:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 7 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:38:46] 10PAWS: PAWS down - https://phabricator.wikimedia.org/T352806 (10rook) [19:40:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:42:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:54:28] 10Grid-Engine-to-K8s-Migration: Migrate germancontributioncounts from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319770 (10Aka) 05Open→03Resolved [19:54:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [20:16:50] (PawsJupyterHubDown) resolved: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [20:20:21] 10PAWS: PAWS down - https://phabricator.wikimedia.org/T352806 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/357 [20:20:27] vivian-rook opened https://github.com/toolforge/paws/pull/357 [20:21:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:23:07] 10PAWS: PAWS down - https://phabricator.wikimedia.org/T352806 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/357 [20:23:15] vivian-rook closed https://github.com/toolforge/paws/pull/357 [20:26:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:27:11] 10PAWS: PAWS down - https://phabricator.wikimedia.org/T352806 (10rook) Appears to have been the result of a failed auth key. [20:27:18] 10PAWS: PAWS down - https://phabricator.wikimedia.org/T352806 (10rook) 05Open→03Resolved [21:07:41] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [21:07:43] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [21:07:59] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [21:08:01] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [21:09:54] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [21:09:56] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [21:13:41] 10Grid-Engine-to-K8s-Migration: Migrate noclaims from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319927 (10Multichill) Don't sabotage my grid jobs. That wouldn't be appreciated. [21:13:48] 10Grid-Engine-to-K8s-Migration: Migrate multichill from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319912 (10Multichill) Don't sabotage my grid jobs. That wouldn't be appreciated. [21:13:55] 10Grid-Engine-to-K8s-Migration: Migrate geograph from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319765 (10Multichill) Don't sabotage my grid jobs. That wouldn't be appreciated. [21:14:10] 10Grid-Engine-to-K8s-Migration: Migrate family from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319739 (10Multichill) Don't sabotage the grid jobs. That wouldn't be appreciated. [21:14:39] 10Grid-Engine-to-K8s-Migration: Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319787 (10Multichill) Don't sabotage the grid jobs. That wouldn't be appreciated. [21:14:56] (ToolsGridQueueProblem) resolved: Grid queue webgrid-lighttpd@tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [22:27:26] 10Toolforge, 10cloud-services-team: tools-sgeweblight / drives very full - https://phabricator.wikimedia.org/T352802 (10taavi) I'm fairly sure this is just because all of the software installed on the grid nodes. [22:38:20] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [22:43:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [23:39:03] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/18 [maintain-harbor]: cleanup old produc...