[00:08:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:13:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:31:29] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate coibot.linkwatcher.eqiad.wmflabs is about to expire in 25d 23h 48m 37s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:46:55] FIRING: MaxConntrack: Max conntrack at 80.27% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:51:55] RESOLVED: MaxConntrack: Max conntrack at 80.27% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:16:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:16:24] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [01:16:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:21:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:21:48] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:22:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:24:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:29:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:38:01] (03approved) 10tstarling: Add hourly update-focus-areas command [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/1 (https://phabricator.wikimedia.org/T364648) (owner: 10samwilson) [01:48:25] (03update) 10samwilson: Add hourly update-focus-areas command [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/1 (https://phabricator.wikimedia.org/T364648) [01:50:48] (03update) 10samwilson: Add hourly update-focus-areas command [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/1 (https://phabricator.wikimedia.org/T364648) [01:53:36] (03merge) 10samwilson: Add hourly update-focus-areas command [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/1 (https://phabricator.wikimedia.org/T364648) [02:10:27] 10Cloud-VPS (Project-requests), 10Beta-Cluster-Infrastructure: Request creation of deployment_prep_s3 VPS project - https://phabricator.wikimedia.org/T372353#10060292 (10Andrew) +1 workaround ridiculous bug [02:13:14] 10Tools, 06Infrastructure-Foundations: Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman - https://phabricator.wikimedia.org/T371644#10060295 (10Htriedman) @KFrancis email sent! and @SLyngshede-WMF this hasn't happened yet, but I'm wondering... [03:53:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:48:18] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:48:48] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [05:16:25] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:41:28] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10060402 (10Raymond_Ndibe) >>! In T370843#10057898, @dcaro wrote: > So there's three related tables in the postrges database, `execution`, `task` and `schedule`, where the `vendor_ty... [05:47:39] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10060404 (10Raymond_Ndibe) Do we have to manually create an `execution` and corresponding `task` for the above failing `schedules`? can that solve our problem? [06:01:14] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10060409 (10Raymond_Ndibe) * Also something to think about: majority of our schedules are `RETENTION` (I dare say more than 80%). Can the fact that we have all of those schedules sch... [07:23:06] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10060461 (10dcaro) >>! In T370843#10060404, @Raymond_Ndibe wrote: > Do we have to manually create an `execution` and corresponding `task` for the above failing `schedules`? can that... [07:26:58] 10Cloud-VPS (Project-requests), 10Beta-Cluster-Infrastructure: Request creation of deployment_prep_s3 VPS project - https://phabricator.wikimedia.org/T372353#10060463 (10dcaro) 05Open→03In progress a:03dcaro [07:28:14] !log dcaro@urcuchillay deployment_prep_s3 START - Cookbook wmcs.vps.create_project for project deployment_prep_s3 in eqiad1 (T372353) [07:28:15] wmbot~dcaro@urcuchillay: Unknown project "deployment_prep_s3" [07:28:15] T372353: Request creation of deployment_prep_s3 VPS project - https://phabricator.wikimedia.org/T372353 [07:28:26] !log dcaro@urcuchillay deployment_prep_s3 END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project deployment_prep_s3 in eqiad1 (T372353) [07:28:26] wmbot~dcaro@urcuchillay: Unknown project "deployment_prep_s3" [07:32:32] 10Cloud-VPS (Project-requests), 10Beta-Cluster-Infrastructure: Request creation of deployment_prep_s3 VPS project - https://phabricator.wikimedia.org/T372353#10060499 (10dcaro) Unfortunately, underscores are not valid domain name characters, so the name would have to be something like `deploymentpreps3`, is th... [07:33:29] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-6 [07:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:39:15] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-6 [07:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:40:37] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: `webservice` requires effective user to be the tool user and listed in NSS passwd data - https://phabricator.wikimedia.org/T369569#10060514 (10dcaro) 05In progress→03Resolved [07:42:29] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [jobs-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367180#10060518 (10dcaro) 05In progress→03Resolved [07:47:09] (03update) 10dcaro: auth: use the header passed by the api gateway [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/106 (https://phabricator.wikimedia.org/T367180) [08:09:55] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10060574 (10ayounsi) 05Resolved→03Open https://netbox.wikimedia.org/extras/scripts/results/78992/ `cloudcephosd1039 (WMF11571) /dcim/devices/5296/ Pr... [08:24:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-6 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:22:07] (03open) 10dcaro: toolforge_deploy_mr: use the correct name when registering an mr [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/181 [09:22:31] (03update) 10dcaro: toolforge_deploy_mr: use the correct name when registering an mr [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/181 [09:26:10] (03close) 10dcaro: toolforge_deploy_mr: use the correct name when registering an mr [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/181 [09:31:42] (03approved) 10dcaro: auth: use the header passed by the api gateway [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/106 (https://phabricator.wikimedia.org/T367180) [09:31:48] (03merge) 10dcaro: auth: use the header passed by the api gateway [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/106 (https://phabricator.wikimedia.org/T367180) [09:34:42] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: jobs-api: bump to 0.0.329-20240813093158-b193b876 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/479 (https://phabricator.wikimedia.org/T367180) [09:38:51] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [09:39:20] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10060929 (10dcaro) I got this when trying to set the fqdn (checked others that have the fqdn set on the ipv6, and they don't have the role set, maybe a new... [09:40:49] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [09:40:57] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [09:42:20] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [09:43:05] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [09:44:22] FIRING: HAProxyBackendUnavailable: HAProxy service wikireplica-db-web-s5 backend clouddb1016.eqiad.wmnet is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:48:10] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [09:48:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10060978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1016.eqiad.wmnet with OS bookworm [09:49:13] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [09:49:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service wikireplica-db-web-s5 backend clouddb1016.eqiad.wmnet is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:54:42] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:03:38] (03update) 10dcaro: worker: add simple task and worker process [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/1 (https://phabricator.wikimedia.org/T370321) [10:07:09] (03approved) 10dcaro: jobs-api: bump to 0.0.329-20240813093158-b193b876 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/479 (https://phabricator.wikimedia.org/T367180) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:07:11] (03update) 10dcaro: jobs-api: bump to 0.0.329-20240813093158-b193b876 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/479 (https://phabricator.wikimedia.org/T367180) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:07:13] (03merge) 10dcaro: jobs-api: bump to 0.0.329-20240813093158-b193b876 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/479 (https://phabricator.wikimedia.org/T367180) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:09:11] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [components-api] Get a skeleton of API webservice and implement `/tool//deploy` with build-only features - https://phabricator.wikimedia.org/T362069#10061039 (10dcaro) a:05dcaro→03Slst2020 [10:09:44] 10Toolforge (Toolforge iteration 14): [harbor] Investigate how to deactivate wal from trove for postrges databases - https://phabricator.wikimedia.org/T370845#10061031 (10dcaro) 05Open→03Declined This is invalid now, if we fix the cleanup processes we don't care about the archival (it would be good actua... [10:11:33] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [sct.frontend] Show the backend status - https://phabricator.wikimedia.org/T370324#10061041 (10dcaro) 05In progress→03Resolved [10:13:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [sct.backend] Create worker and connect to redis - https://phabricator.wikimedia.org/T370321#10061036 (10dcaro) 05Open→03In progress [10:19:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service wikireplica-db-web-s5 backend clouddb1016.eqiad.wmnet is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:27:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10061059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1016.eqiad.wmnet with OS bookworm completed: - cl... [10:28:02] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10061075 (10fnegri) [10:53:17] (03update) 10dcaro: worker: add simple task and worker process [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/1 (https://phabricator.wikimedia.org/T370321) [10:54:21] (03open) 10dcaro: show task status [toolforge-repos/sample-complex-app-frontend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-frontend/-/merge_requests/2 [10:59:29] (03update) 10dcaro: show task status [toolforge-repos/sample-complex-app-frontend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-frontend/-/merge_requests/2 (https://phabricator.wikimedia.org/T370321) [11:00:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [sct.backend] Create worker and connect to redis - https://phabricator.wikimedia.org/T370321#10061157 (10dcaro) [11:51:35] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#10061264 (10dcaro) 05Open→03Resolved Done :) [12:11:48] 10Quarry: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372394 (10rook) 03NEW [12:11:50] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395 (10rook) 03NEW [12:15:23] (03update) 10dcaro: worker: add simple task and worker process [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/1 (https://phabricator.wikimedia.org/T370321) [12:20:26] (03update) 10dcaro: worker: add simple task and worker process [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/1 (https://phabricator.wikimedia.org/T370321) [12:24:38] 10Quarry: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372394#10061438 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/62 [12:24:45] vivian-rook opened https://github.com/toolforge/quarry/pull/62 [12:30:04] 10Quarry: remove k8s_123_2 cluster from tofu - https://phabricator.wikimedia.org/T372397 (10rook) 03NEW [12:31:05] vivian-rook opened https://github.com/toolforge/quarry/pull/63 [12:34:46] vivian-rook closed https://github.com/toolforge/quarry/pull/63 [12:40:18] 10Quarry: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372394#10061464 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/62 [12:40:27] vivian-rook closed https://github.com/toolforge/quarry/pull/62 [12:46:08] 10Quarry: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372394#10061471 (10rook) 05Open→03Resolved a:03rook [12:46:40] 10Quarry: remove k8s_123_2 cluster from tofu - https://phabricator.wikimedia.org/T372397#10061479 (10rook) https://github.com/toolforge/quarry/pull/63 [12:46:53] 10Quarry: remove k8s_123_2 cluster from tofu - https://phabricator.wikimedia.org/T372397#10061480 (10rook) 05Open→03Resolved [12:48:57] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/29 [12:53:38] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10061502 (10cmooney) [12:53:43] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10061503 (10cmooney) [12:59:21] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395#10061546 (10rook) Looks like we're getting ` 'global.postgresql.auth.postgresPassword' must not be empty, please add '--set global.postgresql.auth.postgresPassword=$POSTGRES_PASSWORD' to the comma... [13:11:27] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401 (10DaxServer) 03NEW [13:14:37] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401#10061578 (10DaxServer) Toolserver verification: ` tools-bastion-13.tools.eqiad1.wikimedia.cloud:/home/daxserver/password-reset-reque... [13:22:09] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395#10061610 (10rook) Adding the password gives new errors: ` WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/rook/superset-deploy/tofu/kube.config coalesce.go:... [13:27:50] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395#10061643 (10rook) May be worth deploying to a parallel cluster to see if error persists in a new cluster. [13:33:08] 10superset.wmcloud.org: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395#10061662 (10fnegri) [13:33:10] 10cloud-services-team (FY2023/2024-Q3-Q4), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10061663 (10fnegri) [13:40:53] (03PS1) 10Lokal Profil: Updating toolforge login host [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1062402 [13:52:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10061766 (10Jhancock.wm) a:03Jhancock.wm [13:59:52] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10061785 (10Raymond_Ndibe) >>! In T370843#10060461, @dcaro wrote: >>>! In T370843#10060404, @Raymond_Ndibe wrote: >> Do we have to manually create an `execution` and corresponding `t... [14:04:41] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10061798 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=50666174-cba4-46b9-8fa9-cdf8d3361058) set by cmooney@cumin1002 for 0:40:00 on 7 host(s) and their servi... [14:05:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10061799 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3db725ef-06d9-4ef6-8e5f-eecd4b7c5f0f) set by cmooney@cumin1002 for 0:30:00 on 30 host(s) and their serv... [14:12:03] PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 508 bytes in 3.011 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:14:35] 10Tools, 06Infrastructure-Foundations: Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman - https://phabricator.wikimedia.org/T371644#10061826 (10DSeyfert_WMF) Hi @Htriedman - we've kept your Wiki and 1Password accounts active given your pendin... [14:25:46] (03PS1) 10Krinkle: Reduce memory usage [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1062418 [14:27:39] (03PS2) 10Krinkle: Reduce memory usage [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1062418 [14:32:49] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10061889 (10dcaro) > No @dcaro, execution records for the failing schedules do not exist. We only have one one schedule with the `id` of `3218`, `vendor_id` of `-1` and `vendor_type... [15:04:28] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10062061 (10dcaro) by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2dd3ef8ffa8f706eef3030/src/jobservice/sync/schedule.go#L249, I suspect that the policy as l... [15:27:55] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:28:49] 10Cloud-VPS (Project-requests), 10Beta-Cluster-Infrastructure: Request creation of deployment_prep_s3 VPS project - https://phabricator.wikimedia.org/T372353#10062128 (10bd808) >>! In T372353#10060499, @dcaro wrote: > Unfortunately, underscores are not valid domain name characters, so the name would have to be... [15:40:51] 10Toolforge (Toolforge iteration 14): [harbor] 2024-07-24 Tools harbor db out of space - https://phabricator.wikimedia.org/T370843#10062154 (10dcaro) >>! In T370843#10062104, @Raymond_Ndibe wrote: >>>! In T370843#10062061, @dcaro wrote: >> by looking at https://github.com/goharbor/harbor/blob/ccceacfa73db3cb26e2... [15:42:04] PROBLEM - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 508 bytes in 3.015 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:54:29] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401#10062184 (10DaxServer) Please add the email: daxserver@icloud.com [16:15:30] (03update) 10dcaro: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 (owner: 10raymond-ndibe) [16:23:20] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10062290 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=15f30d47-cb35-4a71-a13e-bd0b11e61af8) set by cmooney@cumin1002 for 6:00:00 on 7 host(s) and their servi... [16:54:29] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401#10062374 (10bd808) 05Open→03In progress a:03bd808 [16:56:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [17:01:20] RESOLVED: [4x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:08:14] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401#10062397 (10bd808) 05In progress→03Resolved @DaxServer I can see your newly set email address in the read-only LDAP replica n... [17:32:00] RECOVERY - toolschecker: All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.372 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [17:52:10] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401#10062495 (10DaxServer) Thanks @bd808 I changed the email address and have a new password. When I login using the "DaxServer" acco... [18:01:46] 06cloud-services-team, 10wikitech.wikimedia.org, 06Trust-and-Safety: Account recovery help needed for Developer account [DaxServer] - https://phabricator.wikimedia.org/T372401#10062541 (10bd808) >>! In T372401#10062495, @DaxServer wrote: > However, when I move on to idm.wikimedia.org, the account with "d... [19:46:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:48:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:51:00] FIRING: [2x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:10:59] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 06serviceops: wikitech self-auth: Allow wikitech to use its own internal authentication - https://phabricator.wikimedia.org/T371588#10062856 (10bd808) [20:34:47] 10Cloud-VPS (Debian Buster Deprecation), 10Humaniki: Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10062930 (10Maximilianklein) update for 2024-08-13 [x] create cinder volume. [x] move project code [x] move mysql-db files [x] create a new debian bookworm inst... [21:02:01] 10Cloud-VPS, 10Beta-Cluster-Infrastructure: OpenTofu fails to provision a Magnum managed k8s cluster in deployment-prep - https://phabricator.wikimedia.org/T372365#10063009 (10bd808) Manual cleanup of tofu failure: ` $ sudo wmcs-openstack coe cluster list +--------------------------------------+---------------... [21:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:23:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-20 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:37:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [21:41:42] (03PS3) 10GergesShamon: use date() instead of strftime() [labs/tools/intuition] - 10https://gerrit.wikimedia.org/r/1055408 (https://phabricator.wikimedia.org/T331468) [22:14:08] 10Cloud-VPS, 10Beta-Cluster-Infrastructure: OpenTofu fails to provision a Magnum managed k8s cluster in deployment-prep - https://phabricator.wikimedia.org/T372365#10063118 (10bd808) {T332194} looks to have been the same general problem ("Failed to create trustee or trust for Cluster"). Per T332194#8710538 I t... [22:54:15] 10Tool-Pageviews: pageviews tool doesn't work in several newer wikis - https://phabricator.wikimedia.org/T371997#10063169 (10MusikAnimal) I've added and deployed a few dozen projects that hopefully is now the complete list. I'm keeping this task open to track the effort to automate this process. [22:57:29] 10Tool-Pageviews: Automatically detect available projects in Pageviews - https://phabricator.wikimedia.org/T371997#10063170 (10MusikAnimal) 05Open→03In progress p:05Triage→03High [23:02:38] 10Tool-Pageviews: Validate projects on entry in Pageviews instead of bundling the allowlist - https://phabricator.wikimedia.org/T371997#10063191 (10MusikAnimal) [23:03:57] 10Tool-Pageviews: Add support for Wikifunctions.org - https://phabricator.wikimedia.org/T354285#10063193 (10MusikAnimal) 05Open→03Resolved a:03MusikAnimal Apologies for the long delay. This is now done: https://pageviews.wmcloud.org/topviews/?project=wikifunctions.org I'm working on finally, //finally... [23:12:36] 10Cloud-VPS, 10Beta-Cluster-Infrastructure: OpenTofu fails to provision a Magnum managed k8s cluster in deployment-prep - https://phabricator.wikimedia.org/T372365#10063201 (10bd808) 05Open→03Resolved ` Apply complete! Resources: 3 added, 0 changed, 0 destroyed. ` The need for "Unrestricted (dangerous... [23:51:00] FIRING: [2x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse