[00:10:51] 10Tool-Global-user-contributions: Add Flow contributions to GUC - https://phabricator.wikimedia.org/T114777#10640158 (10Pppery) 05Open→03Declined Not worth doing given the current status of Flow deployment [00:53:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:34:36] 10VPS-project-Codesearch: Codesearches are timing out (2025-03-17) - https://phabricator.wikimedia.org/T389027 (10Dylsss) 03NEW [01:41:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [02:26:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:17:02] 10Toolforge (Toolforge iteration 18): [builds-api] Store the commit hash that was used for the build - https://phabricator.wikimedia.org/T389043 (10dcaro) 03NEW [09:17:43] 10Toolforge (Toolforge iteration 18): [components-api,buildsa-api] When building and deploying, if none of the settings changed, the jobs are not restarted - https://phabricator.wikimedia.org/T389044 (10dcaro) 03NEW [09:18:27] 10Toolforge (Toolforge iteration 18): [components-api,buildsa-api] When building and deploying, if none of the settings changed, the jobs are not restarted - https://phabricator.wikimedia.org/T389044#10640698 (10dcaro) [09:52:20] (03approved) 10dcaro: Fix string lookup in get_versions [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/721 (owner: 10fnegri) [09:52:23] (03merge) 10dcaro: Fix string lookup in get_versions [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/721 (owner: 10fnegri) [10:07:50] 06cloud-services-team, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 07Epic: Streamline WMCS Alerting and Paging - https://phabricator.wikimedia.org/T313444#10640899 (10dcaro) I think we can close this and re-open whenever we do another focused push to it. There's been many improveme... [10:07:57] 06cloud-services-team, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 07Epic: Streamline WMCS Alerting and Paging - https://phabricator.wikimedia.org/T313444#10640900 (10dcaro) 05Open→03Resolved [10:19:36] !log aborrero@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.remove_user_from_project for user 'rook' [10:19:44] !log aborrero@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.remove_user_from_project (exit_code=0) for user 'rook' [10:43:46] 10VPS-project-Codesearch: Codesearches are timing out (2025-03-17) - https://phabricator.wikimedia.org/T389027#10640983 (10Ladsgroup) Are you sure? It works for me. I can ssh into the host and it's healthy: {F58850434} {F58850436} [11:38:30] 06cloud-services-team, 10Data-Services: Drop views of module_deps tables - https://phabricator.wikimedia.org/T388982#10641242 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-views run by ladsgroup: Started updating wiki replica views [11:43:18] 06cloud-services-team, 10Data-Services: Drop views of module_deps tables - https://phabricator.wikimedia.org/T388982#10641247 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-views started by ladsgroup completed: - an-redacteddb1001.eqiad.wmnet (**PASS**) - Ran Puppet agent - Ran 'maintain... [11:56:37] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:23:28] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:28:14] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:30:04] 10Striker, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#10641371 (10MoritzMuehlenhoff) Wikimedia IDM/Bitu now stores the SUL name in LDAP under the wikimediaGlobalAcountName attribute. I'm... [12:32:48] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:36:28] 10Striker, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#10641385 (10SLyngshede-WMF) Please note that we also have wikimediaGlobalAccountId which stores the account ID and usernames may change. [12:36:57] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:40:25] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:46:41] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [12:48:01] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [13:25:59] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [13:30:07] 06cloud-services-team: Chuck Onwumelu internship: experiments with Toolsbeta and lima-kilo - https://phabricator.wikimedia.org/T386806#10641666 (10Chuckonwumelu) 05Open→03Resolved [13:30:15] 06cloud-services-team: Chuck Onwumelu internship: experiments with Toolsbeta and lima-kilo - https://phabricator.wikimedia.org/T386806#10641669 (10Chuckonwumelu) [13:46:27] (03update) 10dcaro: [maintain-harbor] increase milhistbot harbor quota to 2Gi [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/720 (https://phabricator.wikimedia.org/T388274) (owner: 10raymond-ndibe) [13:46:39] (03update) 10dcaro: [maintain-harbor] increase milhistbot harbor quota to 2Gi [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/720 (https://phabricator.wikimedia.org/T388274) (owner: 10raymond-ndibe) [13:59:49] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.010 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:01:47] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 54.071 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:31:51] 06cloud-services-team: Analyze Toolforge and Toolsbeta for Virtual Resource Usage - https://phabricator.wikimedia.org/T389081 (10Chuckonwumelu) 03NEW [14:51:00] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T388965) [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:51:06] T388965: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965 [14:52:39] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-75 (T388965) [14:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:16:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-75 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:21:15] (03update) 10raymond-ndibe: [maintain-harbor] increase milhistbot harbor quota to 2Gi [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/720 (https://phabricator.wikimedia.org/T388274) [15:21:17] (03approved) 10raymond-ndibe: [maintain-harbor] increase milhistbot harbor quota to 2Gi [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/720 (https://phabricator.wikimedia.org/T388274) [15:21:55] (03merge) 10raymond-ndibe: [maintain-harbor] increase milhistbot harbor quota to 2Gi [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/720 (https://phabricator.wikimedia.org/T388274) [15:27:00] 10Striker, 10Bitu, 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Inconsistent mapping of Developer accounts and SUL accounts across Phabricator, Bitu, and Striker - https://phabricator.wikimedia.org/T388498#10642294 (10SLyngshede-WMF) p:05Triage→03Low [15:27:59] 10Toolforge (Quota-requests): Request increased quota for milhistbot toolforge tool - https://phabricator.wikimedia.org/T387950#10642302 (10Raymond_Ndibe) @Hawkeye7 this has been resolved. `milhistbot` tool build quota has been increased to `2Gi`. Please let us know if there are any issues [15:28:08] 10Toolforge (Quota-requests): Request increased quota for milhistbot toolforge tool - https://phabricator.wikimedia.org/T387950#10642304 (10Raymond_Ndibe) 05Open→03Resolved [15:41:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:43:40] 10Striker, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#10642400 (10bd808) >>! In T148048#10641371, @MoritzMuehlenhoff wrote: > Wikimedia IDM/Bitu now stores the SUL name in LDAP under the... [15:51:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-75 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:52:46] 06cloud-services-team: Analyze Toolforge and Toolsbeta for Virtual Resource Usage - https://phabricator.wikimedia.org/T389081#10642420 (10aborrero) Some resources are automated using cookbooks and puppet, repos are here: * https://gerrit.wikimedia.org/g/cloud/wmcs-cookbooks * https://gerrit.wikimedia.org/r/plugi... [16:05:29] 10VPS-project-Codesearch: Codesearches are timing out (2025-03-17) - https://phabricator.wikimedia.org/T389027#10642472 (10Dzahn) Works normal for me. [16:16:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-38 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:36:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-38 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [16:57:59] (03update) 10dcaro: deployment: time out stuck deployment after 1h [repos/cloud/toolforge/components-api] (shorten_deploy_id) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/59 [17:11:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-57 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:18:47] (03update) 10aborrero: Draft: test new project module [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/93 (https://phabricator.wikimedia.org/T375283) (owner: 10fnegri) [17:40:39] (03update) 10dcaro: deployment: time out stuck deployment after 1h [repos/cloud/toolforge/components-api] (shorten_deploy_id) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/59 [18:32:32] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-10 (T383238) [18:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:32:37] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:32:42] !log dcaro@acme tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-10 (T383238) [18:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:36:06] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-10 (T383238) [18:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:37:51] !log dcaro@acme tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-10 (T383238) [18:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:37:55] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:41:00] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-10 (T383238) [18:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:42:07] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-10 (T383238) [18:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:00:02] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-57 (T383238) [19:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:00:06] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [19:01:36] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-57 (T383238) [19:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:36:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:12:53] 10Toolforge (Toolforge iteration 18): [jobs-api] refactor models - https://phabricator.wikimedia.org/T389118 (10Raymond_Ndibe) 03NEW [20:14:52] (03open) 10raymond-ndibe: [jobs-api] refactor imagename field in models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/149 (https://phabricator.wikimedia.org/T389118) [20:17:43] (03update) 10raymond-ndibe: [jobs-api] refactor imagename field in models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/149 (https://phabricator.wikimedia.org/T389118) [22:20:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:26:20] (03open) 10raymond-ndibe: [jobs-cli] use imagename in definedjob [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/90 (https://phabricator.wikimedia.org/T389118) [23:05:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses