[01:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:02:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:03:07] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10Raymond_Ndibe) >>! In T353740#9524318, @dcaro wrote: >> I think I got what you mean @dcaro. >> however I searched for the our last decis... [03:04:13] 10Grid-Engine-to-K8s-Migration: [apt-buildpack] Incorrect parsing of alternative dependencies - https://phabricator.wikimedia.org/T357085 (10tstarling) [03:04:53] 10Toolforge Build Service: [apt-buildpack] Incorrect parsing of alternative dependencies - https://phabricator.wikimedia.org/T357085 (10Pppery) [04:02:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [04:06:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:13:47] 10Toolforge Jobs framework: Toolforge Cronjobs stopped running - https://phabricator.wikimedia.org/T357088 (100xDeadbeef) [04:50:28] 10Tool-Pageviews: Massviews is creating URLs which cannot be used - https://phabricator.wikimedia.org/T357087 (10Bugreporter) [06:11:35] 10Toolforge Build Service: [apt-buildpack] Can't specify a package by URL or add a repo - https://phabricator.wikimedia.org/T357091 (10tstarling) [06:12:51] 10Tool-Pageviews: Massviews is creating URLs which cannot be used - https://phabricator.wikimedia.org/T357087 (10John_Cummings) OK, I was kindly helped by @PrimeHunter The workaround is to use a slightly different query which gives the same results and doesn't break the URL insource:fao insource:/(fao.org|pu... [07:06:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:07:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [08:11:20] 10Tool-extjsonuploader: Handle network failures better in extjsonuploader popularity script - https://phabricator.wikimedia.org/T357094 (10Tgr) [08:14:38] 10Tool-extjsonuploader: extjsonuploader complains about "Duplicate extension name 'SomeExtension' detected in these files" - https://phabricator.wikimedia.org/T357095 (10Tgr) [09:02:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-50 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:07:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-50 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:21:10] 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10dcaro) [09:21:23] 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10dcaro) 05Open→03In progress a:03dcaro [09:21:31] 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10dcaro) p:05Triage→03Medium [09:21:47] 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10dcaro) It seems similar to {T355022} [09:23:26] 10Toolforge Build Service: [apt-buildpack] Can't specify a package by URL or add a repo - https://phabricator.wikimedia.org/T357091 (10dcaro) Yep, the readme is from the upstream buildpack, that we had to heavily modify lately, so it does not match it anymore. I have to update it, we decided for startes to not a... [09:26:06] 10Toolforge Build Service, 10Patch-For-Review: [apt-buildpack] Incorrect parsing of alternative dependencies - https://phabricator.wikimedia.org/T357085 (10CodeReviewBot) tacsipacsi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpacks/apt-buildpack/-/merge_requests/2 Use first dependency if th... [09:27:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:32:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:35:21] 10Toolforge (Toolforge iteration 05): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972 (10dcaro) [09:36:33] 10Toolforge, 10cloud-services-team: There are some files that I cannot view, delete, or do anything to - https://phabricator.wikimedia.org/T355022 (10dcaro) @taavi did you do anything else than just `rm` of the file? I got a similar issue but I can't rm from the nfs server either. [09:36:37] 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10dcaro) [09:39:14] 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10dcaro) The rm -rf process on the NFS server seems to get stuck in the unilnk syscall: ` root@tools-nfs-2:/srv/tools/todelete# strace -p 3675555 strace: Process 3675555 attached un... [09:46:03] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-50 [09:46:41] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-50 [09:46:52] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-51 [09:47:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-51 [09:47:42] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:51:49] 10Toolforge Jobs framework: Toolforge Cronjobs stopped running - https://phabricator.wikimedia.org/T357088 (10taavi) ` tools.dbreps@tools-sgebastion-11:~$ toolforge jobs list Job name: Job type: Status: ----------- ------------------- ---------------------------------------- rusty schedule... [09:56:24] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-14.tools.eqiad1.wikimedia.cloud to the cluster [09:56:24] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [10:04:31] 10Toolforge Build Service, 10Patch-For-Review: [apt-buildpack] Incorrect parsing of alternative dependencies - https://phabricator.wikimedia.org/T357085 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpacks/apt-buildpack/-/merge_requests/2 Use first dependency if there a... [10:06:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:07:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [10:28:51] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10dcaro) [10:29:13] 10Toolforge Build Service, 10Patch-For-Review: [apt-buildpack] Incorrect parsing of alternative dependencies - https://phabricator.wikimedia.org/T357085 (10dcaro) 05Open→03Resolved a:03dcaro This should be fixed, now only the first of the dependency alternatives is downloaded, this allows `iipimage-serve... [10:30:09] 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Create Bookworm-based standalone webservice image - https://phabricator.wikimedia.org/T355231 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/to... [10:31:02] 10Toolforge Build Service: [apt-buildpack] Can't specify a package by URL or add a repo - https://phabricator.wikimedia.org/T357091 (10dcaro) I have updated the docs, and the buildpack to explicitly disallow external packages and repos. If there's any specific need for any external package, please open a task r... [10:31:11] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10dcaro) [10:31:19] 10Toolforge Build Service: [apt-buildpack] Can't specify a package by URL or add a repo - https://phabricator.wikimedia.org/T357091 (10dcaro) 05Open→03Invalid [10:31:29] 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes: Create Bookworm-based standalone webservice image - https://phabricator.wikimedia.org/T355231 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/192 image-config: bump... [10:31:58] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component image-config [10:32:10] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component image-config [10:32:16] 10Toolforge Jobs framework: jobs-api: Periodically refresh image-config data - https://phabricator.wikimedia.org/T357112 (10taavi) [10:34:43] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component image-config [10:34:56] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component image-config [11:01:38] (ProbeDown) firing: Service toolsbeta-test-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:06:38] (ProbeDown) resolved: Service toolsbeta-test-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:09:34] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.824% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:18:58] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10dcaro) > Ok, I'm going to ask a question: > * in your opinion do you think this is something worth pursuing? do you think the time we'd... [11:22:41] 10Toolforge Build Service: [apt-buildpack] Can't specify a package by URL or add a repo - https://phabricator.wikimedia.org/T357091 (10tstarling) I used it to work around T357085 since it allows you to install packages without installing their dependencies. If these features worked, I would have also used them... [11:29:34] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.993% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:37:47] 10Toolforge (Toolforge iteration 05), 10cloud-services-team, 10Kubernetes: Create Bookworm-based standalone webservice image - https://phabricator.wikimedia.org/T355231 (10taavi) 05Open→03Resolved [11:39:04] 10Toolforge Build Service: [apt-buildpack] Can't specify a package by URL or add a repo - https://phabricator.wikimedia.org/T357091 (10dcaro) > If these features worked, I would have also used them in panoviewer. If you recall, I ended up building a static binary locally and running it over NFS because Ubuntu dr... [11:41:17] (03CR) 10David Caro: [C: 03+1] "As extra logging is ok for me, note that you might want to reword the commit message too" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [11:42:21] 10Toolforge: [webservice] Add health probes for port 8080 - https://phabricator.wikimedia.org/T356907 (10taavi) [11:42:33] 10Grid-Engine-to-K8s-Migration, 10User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10taavi) [11:42:36] 10Toolforge: [webservice] Add health probes for port 8080 - https://phabricator.wikimedia.org/T356907 (10taavi) [11:42:47] 10Grid-Engine-to-K8s-Migration: Migrate glamify from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319772 (10dcaro) Ack, thanks @Asaf [11:43:22] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10tstarling) > It'll be the same as what I did for panoviewer (except hopefully less complicated). I don't think it's going to be less complicated anymore, but at lea... [11:48:35] 10Grid-Engine-to-K8s-Migration, 10User-dcaro: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro) [11:48:53] 10Toolforge (Toolforge iteration 05): Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10dcaro) 05Open→03In progress [11:57:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:02:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:07:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:12:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:32:57] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:36:21] (03CR) 10Stevemunene: [C: 03+1] Move #data-platform-sre announcements to a dedicated channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/999007 (https://phabricator.wikimedia.org/T352783) (owner: 10Btullis) [12:59:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:04:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:05:42] 10Cloud-Services, 10Toolforge: webservicewatcher does not log exceptions to user log - https://phabricator.wikimedia.org/T115226 (10taavi) 05Open→03Declined Declining a grid-related task as the grid is going away. [13:06:32] 10Toolforge: Rounding and missing units in VMEM values on http://tools.wmflabs.org/?status create misleading values - https://phabricator.wikimedia.org/T119680 (10taavi) 05Open→03Declined Declining a grid-related task as the grid is going away. [13:06:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:07:06] 10Toolforge, 10Documentation: 'new webproxy' checklist - https://phabricator.wikimedia.org/T104768 (10taavi) 05Open→03Declined [13:07:08] 10Toolforge, 10Documentation, 10Tracking-Neverending: Toolforge admin guides (tracking) - https://phabricator.wikimedia.org/T104734 (10taavi) [13:07:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [13:08:40] 10Cloud-Services, 10Toolforge: Monitor that proxylistener is accepting new connections - https://phabricator.wikimedia.org/T91958 (10taavi) 05Open→03Declined Declining a grid-related task as the grid is going away. [13:09:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:13:01] 10Toolforge, 10cloud-services-team: Audit SGE shadow config and puppet manifests - https://phabricator.wikimedia.org/T315416 (10taavi) 05Open→03Declined Declining a grid-related task as the grid is going away. [13:14:31] 10Toolforge, 10cloud-services-team: Toolforge: Automate kubernetes control node upgrade - https://phabricator.wikimedia.org/T301000 (10taavi) 05Open→03Resolved a:03taavi This is more or less done. [13:14:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:16:21] 10Toolforge, 10cloud-services-team: [toolforge] Automate maintenance operations - https://phabricator.wikimedia.org/T288583 (10taavi) [13:16:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:17:11] 10Toolforge, 10cloud-services-team: [toolforge] Automate addition/removal of proxy node - https://phabricator.wikimedia.org/T274500 (10taavi) 05Open→03Declined Declining a grid-related task as the grid is going away. [13:17:28] (InstanceDown) firing: Project tools instance tools-redis-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:21:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:22:28] (InstanceDown) resolved: Project tools instance tools-redis-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:27:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:32:56] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:51:10] 10Toolforge (Toolforge iteration 05), 10Toolforge Jobs framework, 10Patch-For-Review, 10User-aborrero: toolforge: introduce OpenAPI to jobs framework - https://phabricator.wikimedia.org/T356523 (10CodeReviewBot) aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/59... [13:59:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:01:48] 10PAWS: Backup prometheus - https://phabricator.wikimedia.org/T356769 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/373 [14:01:56] vivian-rook opened https://github.com/toolforge/paws/pull/373 [14:04:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:10:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:11:27] 10PAWS: Backup prometheus - https://phabricator.wikimedia.org/T356769 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/373 [14:11:37] vivian-rook closed https://github.com/toolforge/paws/pull/373 [14:11:49] 10PAWS: Backup prometheus - https://phabricator.wikimedia.org/T356769 (10rook) 05Open→03Resolved [14:15:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:19:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:24:56] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:28:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:32:55] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Patch-For-Review, 10User-aborrero: cloudgw: add cloud-private subnet support - https://phabricator.wikimedia.org/T338334 (10taavi) a:05taavi→03None [14:33:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:34:02] 10Toolforge, 10cloud-services-team: Remove Python/webservice-runner from toolforge web containers - https://phabricator.wikimedia.org/T293552 (10taavi) [14:36:36] 10Toolforge, 10cloud-services-team: Support hosting Rust tools on Toolforge - https://phabricator.wikimedia.org/T194953 (10taavi) We have a rust buildpack, and since {T355231} a rather standalone image for hosting webservices on NFS. Anything left to do here? [14:54:00] 10Tool-Pageviews: Massviews is creating URLs which cannot be used - https://phabricator.wikimedia.org/T357087 (10PrimeHunter) Steps to reproduce for a simpler example without so many search results. 1) Go to https://pageviews.wmcloud.org/massviews 2) Select "Search" under "Source" 3) Enter foobar=baz in the nex... [15:15:36] 10Toolforge Jobs framework, 10User-aborrero: jobs-api: Periodically refresh image-config data - https://phabricator.wikimedia.org/T357112 (10aborrero) [15:40:19] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Some VPS instances still using ns-recursor0 - https://phabricator.wikimedia.org/T346426 (10aborrero) Can we assume the affected VMs that you discovered are either unmaintained, or have some special configuration? Maybe we just got the... [16:06:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:07:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [16:13:42] 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10taavi) Thanks. This generally looks good I think. > ii - Traffic from VMs to specific cloud-private destinations, using as many rules as needed... [16:18:45] 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: Improve cloudgw filter between VM instances and cloud-private - https://phabricator.wikimedia.org/T356986 (10aborrero) The proposal here looks good to me. Let me know if you want me do to the changes, or if you prefer to bootstrap the patch yourself. [17:02:02] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge API] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745 (10aborrero) Creating a single openAPI spec from multiple files: https://davidgarcia.dev/posts/how-... [17:13:10] 10Data-Services, 10cloud-services-team, 10Data Products, 10Data-Platform, 10Patch-For-Review: Add global_edit_count to wikireplicas - https://phabricator.wikimedia.org/T344108 (10lbowmaker) [17:28:51] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Some VPS instances still using ns-recursor0 - https://phabricator.wikimedia.org/T346426 (10Andrew) They were likely broken enough for cumin to not reach them. I'll nonetheless work on that list a bit. [17:42:57] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge API] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745 (10dcaro) >>! In T354745#9529817, @aborrero wrote: > Creating a single openAPI spec from multiple f... [17:43:38] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Some VPS instances still using ns-recursor0 - https://phabricator.wikimedia.org/T346426 (10Andrew) According to cumin: - Two of those hosts have ns0 in their resolv.conf, - One of them is unreachable (quarry-nfs-dev-02.quarry.eqiad1.... [17:47:27] 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/372 [17:47:34] 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10rook) 05Open→03Resolved [17:47:39] vivian-rook closed https://github.com/toolforge/paws/pull/372 [17:51:16] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) The pattern I see for the bot is: * At some point it opens many (~1k) connections: ` root@tools-k8s-worker-101:~# lsof -p 7332 | grep TCP | wc... [18:14:56] 10Toolforge (Toolforge iteration 05), 10Patch-For-Review: Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/23 add probes [18:38:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-3 is lagging behind the primary, the current lag is 278416 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [19:06:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:07:26] 10Grid-Engine-to-K8s-Migration: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10LucasWerkmeister) [19:07:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [19:08:19] 10Grid-Engine-to-K8s-Migration, 10Toolforge: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10taavi) [19:21:46] 10Grid-Engine-to-K8s-Migration, 10Toolforge: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10bd808) Related: {T321919} [19:25:16] (03CR) 10Krinkle: [C: 03+2] List the source pages on the gallery pages [labs/tools/fileprotectionsync] - 10https://gerrit.wikimedia.org/r/987213 (owner: 10Legoktm) [19:25:42] (03Merged) 10jenkins-bot: List the source pages on the gallery pages [labs/tools/fileprotectionsync] - 10https://gerrit.wikimedia.org/r/987213 (owner: 10Legoktm) [19:26:44] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [toolsdb] test creating a new replica host - https://phabricator.wikimedia.org/T344717 (10fnegri) The procedure worked (with some adjustments I already added to the wiki), and `tools-db-3` started replicating from `tools-db-1`! To be overly... [19:28:26] (03PS6) 10Andrew Bogott: k8s.reboot: periodically report remaining VMs to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [19:29:53] (03PS7) 10Andrew Bogott: k8s.reboot: periodically report remaining VMs to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [19:31:31] (03PS8) 10Andrew Bogott: k8s.reboot: periodically report remaining VMs to reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [19:32:22] (03CR) 10Andrew Bogott: k8s.reboot: periodically report remaining VMs to reboot (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [19:36:39] 10Tool-extjsonuploader: extjsonuploader complains about "Duplicate extension name 'SomeExtension' detected in these files" - https://phabricator.wikimedia.org/T357095 (10Bawolff) I guess there is a question of which one we should use - the wmf one or the non wmf one. [19:38:56] (03CR) 10Krinkle: [C: 03+2] "Deployed:" [labs/tools/fileprotectionsync] - 10https://gerrit.wikimedia.org/r/987213 (owner: 10Legoktm) [20:52:36] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account [YOUR USERNAME] - https://phabricator.wikimedia.org/T357177 (10Sethabathaba) [21:08:10] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account [YOUR USERNAME] - https://phabricator.wikimedia.org/T357177 (10taavi) 05Open→03Invalid Hi [YOUR USERNAME]. I am unfortunately closing this task because of [FILL REASON HERE]. Please fee... [21:34:21] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140 (10LucasWerkmeister) Alright, some more debugging and hacking later, **I have a working version of the webservice**. [Job #10](https://lucaswerkmeister-test.toolforg... [22:07:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [22:11:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [23:01:26] (CloudVPSDesignateLeaks) resolved: (2) Detected 52 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:59:49] (03CR) 10Amire80: [C: 03+1] Merge m2c branch to main [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998266 (https://phabricator.wikimedia.org/T356772) (owner: 10Eugene233)