[00:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [00:51:45] 10Tool-gawa: [Code] Conception de la page FORMULAIRE CONTRIBUTEUR - https://phabricator.wikimedia.org/T404527#11179398 (10poro26) [00:52:15] 10Tool-gawa: [Code] Conception de la page FORMULAIRE ORGANISATEUR - https://phabricator.wikimedia.org/T404528#11179400 (10poro26) [00:53:01] 10Tool-gawa: [Code] Conception de la page FORMULAIRE ORGANISATEUR - https://phabricator.wikimedia.org/T404528#11179402 (10poro26) Mise à jour de la maquette. [00:53:06] 10Tool-gawa: [Code] Conception de la page FORMULAIRE CONTRIBUTEUR - https://phabricator.wikimedia.org/T404527#11179403 (10poro26) Mise à jour de la maquette. [00:56:55] FIRING: [2x] ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 close to running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [01:01:55] FIRING: [2x] ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 close to running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:22:43] (03open) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [02:22:50] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [02:24:31] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [04:35:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-66 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [05:01:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [05:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [05:56:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [07:23:43] (03CR) 10Volans: [C:03+1] "LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [08:12:38] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:13:43] 10cloud-services-team (FY2025/26-Q1), 10Wikidocumentaries: wikidocumentaries on WMCS is sending millions of invalid requests to Europeana.eu servers - https://phabricator.wikimedia.org/T404347#11179840 (10PatEhlert) I can confirm that we are now receiving proper search requests, with API key and changed user a... [08:14:17] 10cloud-services-team (FY2025/26-Q1), 10Wikidocumentaries: wikidocumentaries on WMCS is sending millions of invalid requests to Europeana.eu servers - https://phabricator.wikimedia.org/T404347#11179842 (10PatEhlert) 05In progress→03Resolved [08:16:18] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup fails to start on linux - https://phabricator.wikimedia.org/T404226#11179845 (10dcaro) [08:17:37] RESOLVED: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:03:17] (03update) 10taavi: list: Fix type of one-off jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/127 (https://phabricator.wikimedia.org/T404490) [09:15:19] (03PS1) 10Filippo Giunchedi: passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) [09:16:25] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#11180154 (10fgiunchedi) [09:16:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [09:31:29] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: jobs-cli is reporting one-off jobs as continuous - https://phabricator.wikimedia.org/T404490#11180237 (10dcaro) p:05Triage→03High [09:31:38] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: jobs-cli is reporting one-off jobs as continuous - https://phabricator.wikimedia.org/T404490#11180240 (10dcaro) [09:31:42] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: jobs-cli is reporting one-off jobs as continuous - https://phabricator.wikimedia.org/T404490#11180242 (10dcaro) 05Open→03In progress [09:31:43] (03approved) 10dcaro: list: Fix type of one-off jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/127 (https://phabricator.wikimedia.org/T404490) (owner: 10taavi) [09:31:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [09:57:45] FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [09:57:45] FIRING: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state [10:01:19] FIRING: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [10:01:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [10:01:24] FIRING: JobsApiUpMetricUnknown: JobsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsApiUpMetricUnknown [10:01:33] FIRING: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [10:01:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [10:01:58] FIRING: JobsEmailerUpMetricUnknown: JobsEmailer might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerUpMetricUnknown [10:02:18] FIRING: EnvvarsApiUpMetricUnknown: EnvvarsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiUpMetricUnknown [10:02:23] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [10:02:26] FIRING: BuildsApiUpMetricUnknown: BuildsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DBuildsApiUpMetricUnknown [10:02:33] FIRING: ComponentsApiUpMetricUnknown: ComponentsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ComponentsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DComponentsApiUpMetricUnknown [10:14:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-nfs-3 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:21:19] RESOLVED: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [10:21:24] RESOLVED: JobsApiUpMetricUnknown: JobsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsApiUpMetricUnknown [10:21:58] RESOLVED: JobsEmailerUpMetricUnknown: JobsEmailer might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerUpMetricUnknown [10:22:18] RESOLVED: EnvvarsApiUpMetricUnknown: EnvvarsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiUpMetricUnknown [10:22:33] RESOLVED: ComponentsApiUpMetricUnknown: ComponentsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ComponentsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DComponentsApiUpMetricUnknown [10:22:45] RESOLVED: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state [10:22:45] RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [10:29:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:34:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:37:38] FIRING: ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:42:38] RESOLVED: ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:45:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:47:26] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [10:48:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:49:28] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:51:32] (03open) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [10:51:38] (03update) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [10:55:21] (03merge) 10raymond-ndibe: [cli] add tool config to deployment object [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/58 (https://phabricator.wikimedia.org/T400064) [11:01:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [11:09:37] (03open) 10raymond-ndibe: d/changelog: bump to 0.0.15 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/61 (https://phabricator.wikimedia.org/T400064) [11:11:56] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [11:17:12] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [11:21:26] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component components-cli [11:21:36] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [11:23:18] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [11:23:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [11:24:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [11:25:39] (03update) 10raymond-ndibe: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 (owner: 10dcaro) [11:28:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [11:29:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [11:31:05] (03CR) 10Majavah: [C:04-1] passwords: root authorized-keys has moved to puppet.git (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:32:03] (03merge) 10taavi: list: Fix type of one-off jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/127 (https://phabricator.wikimedia.org/T404490) [11:32:16] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component components-cli [11:32:56] (03PS2) 10Filippo Giunchedi: passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) [11:33:05] (03CR) 10Filippo Giunchedi: passwords: root authorized-keys has moved to puppet.git (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:33:31] (03CR) 10Majavah: [C:03+1] passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:38:55] (03open) 10taavi: d/changelog: bump to 16.1.20 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/128 (https://phabricator.wikimedia.org/T404490) [11:38:56] (03update) 10taavi: d/changelog: bump to 16.1.20 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/128 (https://phabricator.wikimedia.org/T404490) [11:46:05] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [11:48:03] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [11:53:33] (03update) 10raymond-ndibe: d/changelog: bump to 0.0.15 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/61 (https://phabricator.wikimedia.org/T395077 https://phabricator.wikimedia.org/T398424 https://phabricator.wikimedia.org/T400064) [11:53:54] (03update) 10raymond-ndibe: d/changelog: bump to 0.0.15 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/61 (https://phabricator.wikimedia.org/T395077 https://phabricator.wikimedia.org/T398424 https://phabricator.wikimedia.org/T400064) [12:02:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [12:03:02] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [12:03:20] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component components-cli [12:03:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:06:58] (03open) 10dcaro: run_functional_tests: use the python3.13 image for venv creation [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/964 (https://phabricator.wikimedia.org/T402377) [12:07:13] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [12:07:29] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component components-cli [12:07:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [12:08:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:12:06] (03PS1) 10David Caro: toolforge.inventory: sort bastions from newer to older [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188325 [12:15:10] (03update) 10raymond-ndibe: [build, api] support build queueing beyond max_parallel build config [repos/cloud/toolforge/builds-api] (run_pipeline_cleanup_per_repo) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143 (https://phabricator.wikimedia.org/T402568) [12:17:01] (03approved) 10filippo: run_functional_tests: use the python3.13 image for venv creation [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/964 (https://phabricator.wikimedia.org/T402377) (owner: 10dcaro) [12:17:52] (03CR) 10Filippo Giunchedi: [C:03+1] toolforge.inventory: sort bastions from newer to older [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188325 (owner: 10David Caro) [12:21:25] (03merge) 10dcaro: run_functional_tests: use the python3.13 image for venv creation [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/964 (https://phabricator.wikimedia.org/T402377) [12:21:33] (03CR) 10David Caro: [C:03+2] toolforge.inventory: sort bastions from newer to older [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188325 (owner: 10David Caro) [12:25:06] (03Merged) 10jenkins-bot: toolforge.inventory: sort bastions from newer to older [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188325 (owner: 10David Caro) [12:31:28] (03update) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/44 [12:33:49] 06cloud-services-team: Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584 (10fgiunchedi) 03NEW [12:33:55] 10Toolforge (Toolforge iteration 24): [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585 (10dcaro) 03NEW [12:35:39] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-66 [12:38:13] 10Toolforge (Toolforge iteration 24): [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585#11180661 (10dcaro) According to this https://github.com/kubernetes/kube-state-metrics?tab=readme-ov-file#compatibility-matrix we are using a relatively n... [12:39:25] 06cloud-services-team, 10Toolforge: Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11180662 (10taavi) [12:42:55] 10Toolforge (Toolforge iteration 24): [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585#11180665 (10dcaro) We use chart `6.1.4`, latest is `6.3.0`, will update. Found this: ` ## v2.0.0-rc.1 / 2021-03-26 * [CHANGE] Rename --labels-metric-... [12:44:17] (03open) 10dcaro: wmcs-k8s-metrics: update kube-state-metrics to latest [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/965 [12:44:34] 10Toolforge (Toolforge iteration 24): [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585#11180670 (10dcaro) 05Open→03In progress p:05Triage→03High a:03dcaro [12:47:22] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-66 [12:47:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [12:48:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:52:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.996% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:53:06] !log dcaro@acme paws START - Cookbook wmcs.vps.instance.force_reboot vm paws-127b-rpchztfjt2jb-node-1 (cluster eqiad1, project paws) [12:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:53:33] !log dcaro@acme paws END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm paws-127b-rpchztfjt2jb-node-1 (cluster eqiad1, project paws) [12:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:54:38] !log dcaro@acme paws START - Cookbook wmcs.vps.instance.force_reboot vm paws-127b-rpchztfjt2jb-node-2 (cluster eqiad1, project paws) [12:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:56:03] !log dcaro@acme paws END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm paws-127b-rpchztfjt2jb-node-2 (cluster eqiad1, project paws) [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:56:15] !log dcaro@acme paws START - Cookbook wmcs.vps.instance.force_reboot vm paws-127b-rpchztfjt2jb-node-4 (cluster eqiad1, project paws) [12:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:56:21] !log dcaro@acme paws END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm paws-127b-rpchztfjt2jb-node-4 (cluster eqiad1, project paws) [12:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:59:13] 06cloud-services-team, 10Toolforge: Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11180712 (10fgiunchedi) [13:02:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [13:03:27] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11180740 (10dcaro) 05Open→03In progress a:03fgiunchedi [13:03:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [13:07:34] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11180754 (10dcaro) p:05Triage→03High [13:07:35] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup fails to start on linux - https://phabricator.wikimedia.org/T404226#11180756 (10dcaro) 05Open→03In progress [13:07:36] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11180757 (10fgiunchedi) p:05High→03Triage [13:07:41] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup fails to start on linux - https://phabricator.wikimedia.org/T404226#11180762 (10dcaro) a:03dcaro [13:08:56] 10Toolforge (Toolforge iteration 24): [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176#11180764 (10dcaro) p:05Triage→03Medium a:03dcaro [13:09:04] 10Toolforge (Toolforge iteration 24): [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176#11180768 (10dcaro) 05Open→03In progress [13:13:24] 10Toolforge (Toolforge iteration 24): [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585#11180782 (10dcaro) It turns out that we had a very old deployment in the `kube-system` namespace of `kube-state-metrics`, that is not handled by helm. T... [13:13:41] (03close) 10dcaro: wmcs-k8s-metrics: update kube-state-metrics to latest [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/965 [13:14:21] 10Toolforge (Toolforge iteration 24): [kube-state-metrics,wmcs-k8s-metrics] the images from quay don't work anymore - https://phabricator.wikimedia.org/T404585#11180785 (10dcaro) 05In progress→03Resolved [13:14:44] (03approved) 10dcaro: d/changelog: bump to 16.1.20 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/128 (https://phabricator.wikimedia.org/T404490) (owner: 10taavi) [13:16:28] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:16:47] (03update) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [13:29:55] (03PS6) 10David Caro: vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 [13:30:03] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [13:34:06] (03CR) 10CI reject: [V:04-1] vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [13:38:00] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [13:38:35] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [13:40:27] (03update) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [13:40:44] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:40:46] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:41:07] (03update) 10dcaro: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 [13:43:13] (03update) 10dcaro: scheduledjobs: increase the history to allow log retrieval [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/214 [13:57:06] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [14:00:29] (03update) 10dcaro: logs_api: add the option to enable logs-api [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/75 [14:04:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:06:19] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [14:08:34] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [14:09:48] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [14:10:04] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [14:10:22] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [14:10:59] (03merge) 10taavi: d/changelog: bump to 16.1.20 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/128 (https://phabricator.wikimedia.org/T404490) [14:11:53] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: jobs-cli is reporting one-off jobs as continuous - https://phabricator.wikimedia.org/T404490#11181004 (10taavi) 05In progress→03Resolved [14:27:34] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 6.977% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:34:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:36:47] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [14:57:44] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#11181191 (10fgiunchedi) >>! In T317362#11175946, @fnegri wrote: > Thanks @fgiunchedi for the patch! > > A few docs will need to be updated after... [14:58:09] 06cloud-services-team, 10Toolforge: [jobs-api] use `launcher` also for health-check script commands - https://phabricator.wikimedia.org/T403735#11181196 (10dcaro) [15:45:25] 06cloud-services-team, 10Toolforge: (sd-pam) killed by Wheel of Misfortune on Toolforge bastion - https://phabricator.wikimedia.org/T404601 (10bd808) 03NEW [15:51:55] 10cloud-services-team (FY2025/26-Q1), 10Wikidocumentaries: wikidocumentaries on WMCS is sending millions of invalid requests to Europeana.eu servers - https://phabricator.wikimedia.org/T404347#11181496 (10bd808) a:05fnegri→03TuukkaH [15:52:59] 10Cloud-VPS (Project-requests): Request creation of gitlab-runners-staging VPS project - https://phabricator.wikimedia.org/T404386#11181498 (10dduvall) 05Resolved→03Open >>! In T404386#11177989, @bd808 wrote: >>>! In T404386#11177958, @dduvall wrote: >> @Andrew I don't see any zones listed in the project. Is... [16:00:17] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: (sd-pam) killed by Wheel of Misfortune on Toolforge bastion - https://phabricator.wikimedia.org/T404601#11181550 (10taavi) a:03taavi [16:01:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-50 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:11:26] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: (sd-pam) killed by Wheel of Misfortune on Toolforge bastion - https://phabricator.wikimedia.org/T404601#11181605 (10taavi) 05Open→03Resolved [16:28:59] (03merge) 10taavi: Remove non-useful defaults [repos/cloud/cloud-vps/nova_fullstack_test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/nova_fullstack_test/-/merge_requests/10 [16:29:28] (03open) 10taavi: Update Puppet last run file location [repos/cloud/cloud-vps/nova_fullstack_test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/nova_fullstack_test/-/merge_requests/11 [16:29:33] (03update) 10taavi: Update Puppet last run file location [repos/cloud/cloud-vps/nova_fullstack_test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/nova_fullstack_test/-/merge_requests/11 [16:30:18] (03merge) 10taavi: Update Puppet last run file location [repos/cloud/cloud-vps/nova_fullstack_test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/nova_fullstack_test/-/merge_requests/11 [16:44:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [16:45:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [16:46:38] (03merge) 10dcaro: values: Mount /etc/openstack/clouds.yaml [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/36 (https://phabricator.wikimedia.org/T404438) (owner: 10taavi) [16:50:22] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: volume-admission: bump to 0.0.72-20250915164649-3238fa82 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/966 (https://phabricator.wikimedia.org/T404438) [16:52:55] (03merge) 10taavi: tools: Drop floating IPs for Bookworm bastions [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/79 (https://phabricator.wikimedia.org/T392510) [17:21:11] (03update) 10raymond-ndibe: scheduledjobs: increase the history to allow log retrieval [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/214 (owner: 10dcaro) [17:21:14] (03merge) 10raymond-ndibe: scheduledjobs: increase the history to allow log retrieval [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/214 (owner: 10dcaro) [17:24:16] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.414-20250915172125-3b82d2c2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/967 (https://phabricator.wikimedia.org/T404176) [17:24:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [17:25:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [17:33:38] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [18:03:02] (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [18:03:14] (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [18:03:24] (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [18:06:39] (03update) 10raymond-ndibe: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 (owner: 10dcaro) [18:23:09] (03update) 10raymond-ndibe: package: upgrade all deps [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/64 (owner: 10dcaro) [18:23:10] (03approved) 10raymond-ndibe: package: upgrade all deps [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/64 (owner: 10dcaro) [18:24:15] (03approved) 10raymond-ndibe: pre-commit: add check for openapi spec version bump [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/116 (owner: 10dcaro) [18:34:04] (03update) 10raymond-ndibe: toolforge_deploy_mr: also wait when pipeline is creating [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/272 (owner: 10dcaro) [18:34:07] (03approved) 10raymond-ndibe: toolforge_deploy_mr: also wait when pipeline is creating [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/272 (owner: 10dcaro) [18:34:49] (03update) 10raymond-ndibe: pacakage: bump dependencies [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/141 (owner: 10dcaro) [18:34:50] (03approved) 10raymond-ndibe: pacakage: bump dependencies [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/141 (owner: 10dcaro) [18:36:35] (03update) 10raymond-ndibe: package: upgrade dependencies [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/29 (owner: 10dcaro) [18:36:36] (03approved) 10raymond-ndibe: package: upgrade dependencies [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/29 (owner: 10dcaro) [18:39:31] (03update) 10raymond-ndibe: package: upgrade deps [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/35 (owner: 10dcaro) [18:39:42] (03update) 10raymond-ndibe: package: upgrade deps [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/35 (owner: 10dcaro) [18:39:44] (03approved) 10raymond-ndibe: package: upgrade deps [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/35 (owner: 10dcaro) [19:01:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-50 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:11:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:19:55] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11182318 (10Anomie) At this point I have things mostly working using `webservice perl5.40 shell` to run things, after replacing some code that... [19:25:16] (03PS2) 10Majavah: inventory: Remove Bookworm based bastions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1187762 (https://phabricator.wikimedia.org/T392510) [19:49:04] (03update) 10raymond-ndibe: [status] make job status an enum, with clearly defined states [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/208 (https://phabricator.wikimedia.org/T401172) [20:08:28] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [builds-api, maintain-harbor] fix build/image cleanup - https://phabricator.wikimedia.org/T404157#11182525 (10Raymond_Ndibe) [20:09:23] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [builds-api, maintain-harbor] fix build/image cleanup - https://phabricator.wikimedia.org/T404157#11182529 (10Raymond_Ndibe) [20:09:24] 06cloud-services-team, 10Toolforge: [components-api] rebuilds un-changed images - https://phabricator.wikimedia.org/T403167#11182530 (10Raymond_Ndibe) [20:58:29] 10Cloud-VPS (Project-requests): Request creation of gitlab-runners-staging VPS project - https://phabricator.wikimedia.org/T404386#11182769 (10Andrew) The zones were captured by an earlier failed project-creation attempt, so were owned by a nonexistent project which prevented them from being created in the real... [21:51:35] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11183018 (10Sakretsu) 05Resolved→03Open @fnegri @dcaro I have bad news, the job is stuck again. I'll leave it there for you. You can inspect it when you want. [22:48:37] 06cloud-services-team, 10Toolforge: [components-api] rebuilds un-changed images - https://phabricator.wikimedia.org/T403167#11183169 (10Raymond_Ndibe) Hello @DamianZaremba can you help with reproducing the error in the last message you sent? From my experience the only way this can happen is if you tried `tool... [23:01:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:11:25] 06cloud-services-team, 10Toolforge: [components-api] rebuilds un-changed images - https://phabricator.wikimedia.org/T403167#11183201 (10Raymond_Ndibe) >>! In T403167#11183169, @Raymond_Ndibe wrote: > Hello @DamianZaremba can you help with reproducing the error in the last message you sent? From my experience t... [23:36:06] 10Cloud-VPS (Quota-requests): Increase gitlab-runners-staging volumes to 12 - https://phabricator.wikimedia.org/T404668 (10dduvall) 03NEW [23:46:41] 06cloud-services-team, 10Toolforge: [components-api] rebuilds un-changed images - https://phabricator.wikimedia.org/T403167#11183296 (10Raymond_Ndibe) >>! In T403167#11183169, @Raymond_Ndibe wrote: > Hello @DamianZaremba can you help with reproducing the error in the last message you sent? From my experience t... [23:47:19] 06cloud-services-team, 10Toolforge: [components-api] rebuilds un-changed images - https://phabricator.wikimedia.org/T403167#11183297 (10Raymond_Ndibe) This is exactly the error message you got @DamianZaremba. @dcaro you should also see this