[01:30:25] 10VPS-project-Codesearch, 06collaboration-services: codesearch-write-config cronjob failing since 15 Dec: "RuntimeError: Unsure how to handle URL: https://codeberg.org/chdorner/CheckRegistrationEmailDomains" - https://phabricator.wikimedia.org/T383192 (10Krinkle) 03NEW [01:37:32] 10VPS-project-Codesearch, 06collaboration-services: codesearch-write-config cronjob failing since 15 Dec: "RuntimeError: Unsure how to handle URL: https://codeberg.org/chdorner/CheckRegistrationEmailDomains" - https://phabricator.wikimedia.org/T383192#10439507 (10Krinkle) The stack trace points to `File "/srv/... [01:46:52] (03PS1) 10Krinkle: write_config: Change "unhandled URL" from hard fail to printed message [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1108867 (https://phabricator.wikimedia.org/T383192) [01:47:00] (03CR) 10Krinkle: [C:03+2] write_config: Change "unhandled URL" from hard fail to printed message [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1108867 (https://phabricator.wikimedia.org/T383192) (owner: 10Krinkle) [01:47:43] (03PS2) 10Krinkle: write_config: Change "unhandled URL" from hard fail to printed message [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1108867 (https://phabricator.wikimedia.org/T383192) [01:47:47] (03CR) 10Krinkle: [C:03+2] write_config: Change "unhandled URL" from hard fail to printed message [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1108867 (https://phabricator.wikimedia.org/T383192) (owner: 10Krinkle) [01:48:40] (03Merged) 10jenkins-bot: write_config: Change "unhandled URL" from hard fail to printed message [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1108867 (https://phabricator.wikimedia.org/T383192) (owner: 10Krinkle) [03:33:48] 10Cloud-VPS (Project-requests): Request creation of my first testing on linux VPS project - https://phabricator.wikimedia.org/T383197 (10Gowthamkodali27real) 03NEW [03:46:29] 10Cloud-VPS (Project-requests): Request creation of my first testing on linux VPS project - https://phabricator.wikimedia.org/T383197#10439640 (10Pppery) 05Open→03Declined a:05Gowthamkodali27real→03None This is not at all what Wikimedia cloud VPS is for. [03:48:00] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [07:12:22] FIRING: HAProxyBackendUnavailable: HAProxy service nova-metadata-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [07:36:07] 10Toolforge (Toolforge iteration 16), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#10439779 (10Slst2020) Harbor v2.12 has been [[ https://github.com/goharbor/harbor/releases | released ]]. We need to test if the new en... [07:38:20] 06cloud-services-team, 10Toolforge: [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#10439783 (10Slst2020) We can now test if the [[ https://github.com/goharbor/harbor/releases/tag/v2.12.0 | v2.12 release ]] makes this possible. [07:41:57] 10Toolforge (Toolforge iteration 16), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#10439787 (10Slst2020) a:05Slst2020→03None [08:42:01] 10Cloud Services Proposals, 06cloud-services-team, 06Data-Persistence, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#10439958 (10Gehel) [09:04:01] 10Cloud-VPS (Project-requests): Request creation of my first testing on linux VPS project - https://phabricator.wikimedia.org/T383197#10439985 (10Aklapper) We generally do not grant Cloud VPS projects for single user development use or for non-Wikimedia related use. #Cloud-VPS virtual machines are a constrai... [09:11:47] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10439991 (10dcaro) [09:13:29] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-prometheus-7 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:15:28] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10439995 (10dcaro) Just updated a bit the task requirements to reflect the current status of toolforge, we might want to cre... [09:16:07] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10439996 (10dcaro) [09:18:29] FIRING: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-prometheus-2 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:20:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] 2025-01-08 nova-api-metadata.service down on cloudcontrol1005 - https://phabricator.wikimedia.org/T383203 (10dcaro) 03NEW p:05Triage→03Medium [09:22:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service nova-metadata-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:28:02] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] 2025-01-08 nova-api-metadata.service down on cloudcontrol1005 - https://phabricator.wikimedia.org/T383203#10440012 (10dcaro) Last log I see in `systemctl status nova-api-metadata` is: `... [09:28:29] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-prometheus-6 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:30:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-prometheus-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:33:29] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-prometheus-2 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:24:58] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-prometheus-3 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:25:58] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance tools-prometheus-6 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:27:58] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-prometheus-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:52:38] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10440408 (10Sarai-WMF) Hey @bd808 and @taavi. Thank you both for sharing valuable information and resources regarding Striker's modernization, I'll make... [12:48:44] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api] add basic prometheus instrumentation - https://phabricator.wikimedia.org/T381249#10440570 (10dcaro) [12:48:47] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api, components-cli] deploy-token: separate create from update - https://phabricator.wikimedia.org/T380706#10440572 (10dcaro) [12:48:54] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components - https://phabricator.wikimedia.org/T320284#10440580 (10dcaro) [12:48:56] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [jobs-cli,jobs-api] quota shows different units for limit and usage - https://phabricator.wikimedia.org/T361120#10440578 (10dcaro) [12:48:57] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 07Kubernetes, 13Patch-For-Review: [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime - https://phabricator.wikimedia.org/T306391#10440574 (10dcaro) [12:48:58] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [jobs-emailer] http requests are blocked by the loops - https://phabricator.wikimedia.org/T379924#10440576 (10dcaro) [12:49:10] 10Toolforge (Toolforge iteration 17): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#10440582 (10dcaro) [12:49:14] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#10440586 (10dcaro) [12:49:16] 10Toolforge (Toolforge iteration 17), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#10440593 (10dcaro) [12:49:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 05Goal: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#10440591 (10dcaro) [12:49:19] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#10440589 (10dcaro) [12:49:21] 10Toolforge (Toolforge iteration 17), 07Upstream: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#10440595 (10dcaro) [12:49:23] 10Toolforge (Toolforge iteration 17), 07Upstream: [builds-builder] golang based images get infinite nested loops for procfile entries - https://phabricator.wikimedia.org/T363417#10440597 (10dcaro) [12:49:29] 10Toolforge (Toolforge iteration 17): [toolforge] simplify calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377#10440599 (10dcaro) [12:49:46] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypotesis] 6.3.5 Develop the sustainability score - https://phabricator.wikimedia.org/T376896#10440607 (10dcaro) [12:49:48] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867#10440605 (10dcaro) [12:49:49] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api] Add functional tests for the components api - https://phabricator.wikimedia.org/T379092#10440603 (10dcaro) [12:49:53] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366#10440609 (10dcaro) [12:49:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17): Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#10440611 (10dcaro) [12:50:01] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 07Epic: [components-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#10440617 (10dcaro) [12:50:05] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#10440613 (10dcaro) [12:50:10] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [jobs-api] Save business models in a DB - https://phabricator.wikimedia.org/T359650#10440615 (10dcaro) [12:50:14] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypothesis] WE6.3.4 If we enable the automatic deployment of a minimal tool, we will be able to evaluate the end to end flow and set the groundwork for adding support f... - https://phabricator.wikimedia.org/T375199#10440619 [12:50:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [infra,k8s] remove deprecated kubelet flags before 1.28 upgrade (we might be able to remove all custom ones) - https://phabricator.wikimedia.org/T370245#10440623 (10dcaro) [12:50:42] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#10440624 (10dcaro) [12:50:46] 10Toolforge (Toolforge iteration 17): [usage] Try to get an idea of the amount of tools that were created, but never started anything - https://phabricator.wikimedia.org/T379144#10440627 (10dcaro) [12:50:48] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: Persist maintain-harbor logs - https://phabricator.wikimedia.org/T383081#10440625 (10dcaro) [12:50:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 10Sustainability (Incident Followup): [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components - https://phabricator.wikimedia.org/T380959#10440626 (10dcaro) [12:50:51] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs - https://phabricator.wikimedia.org/T377420#10440628 (10dcaro) [12:50:52] 10Toolforge (Toolforge iteration 17): [jobs-api] prepend date and pod name to filelog lines - https://phabricator.wikimedia.org/T372025#10440629 (10dcaro) [12:50:54] 10Toolforge (Toolforge iteration 17): Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621#10440630 (10dcaro) [12:50:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 17), 07Epic, 05Goal: Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#10440631 (10dcaro) [12:58:58] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789) [12:59:02] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T309789) [12:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:59:05] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [12:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:59:39] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789) [12:59:43] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T309789) [12:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:04:28] (03PS1) 10David Caro: inventory: update ceph mon nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1109061 [13:04:43] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789) [13:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:04:49] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [13:06:34] (03CR) 10David Caro: [C:03+2] gitlab: add a note of maybe using the wrong branch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1090431 (owner: 10David Caro) [13:06:41] (03CR) 10CI reject: [V:04-1] gitlab: add a note of maybe using the wrong branch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1090431 (owner: 10David Caro) [13:08:54] (03CR) 10CI reject: [V:04-1] inventory: update ceph mon nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1109061 (owner: 10David Caro) [13:12:15] (03PS2) 10David Caro: inventory: update ceph mon nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1109061 [13:17:31] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10440707 (10Sarai-WMF) [13:52:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] 2025-01-08 nova-api-metadata.service down on cloudcontrol1005 - https://phabricator.wikimedia.org/T383203#10440915 (10taavi) [14:00:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:08:14] (03CR) 10David Caro: [C:03+2] inventory: update ceph mon nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1109061 (owner: 10David Caro) [14:12:34] (03Merged) 10jenkins-bot: inventory: update ceph mon nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1109061 (owner: 10David Caro) [14:18:03] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T309789) [14:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:18:09] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [14:20:51] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:25:28] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-70 [14:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:25:38] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-70 [14:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:27:38] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-70 [14:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:33:22] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-70 [14:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:51:14] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [14:52:34] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10441189 (10gkyziridis) I am a new user, I followed the instructions. |**Wikitech account/LDAP:**| Gkyziridis| |**SUL account**| Gkyziridis| |**Account linked on [[ https://idm.wikimedia.o... [14:53:17] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [15:00:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-45 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:00:29] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-45 [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:05:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-46 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:08:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-45 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:09:31] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-45 [15:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:10:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [15:13:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-45 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:13:38] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 (10dcaro) 03NEW p:05Triage→03High [15:14:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238#10441294 (10dcaro) [15:14:45] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238#10441297 (10dcaro) 05Open→03In progress [15:19:18] 06cloud-services-team, 10Cloud-VPS, 07Puppet: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10441312 (10Andrew) I would very much like this to work and I also don't immediately know how to do it :( [15:19:53] 06cloud-services-team, 10Cloud-VPS, 07Puppet: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10441315 (10joanna_borun) p:05Triage→03Medium [15:21:17] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10441317 (10joanna_borun) p:05Triage→03Medium [15:21:38] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10441319 (10Andrew) Two of these are now intended for https://phabricator.wikimedia.org/T382356 [15:22:08] 06cloud-services-team, 10Toolforge: jobs-api: Impersonate user instead of loading certs from NFS - https://phabricator.wikimedia.org/T380890#10441321 (10dcaro) p:05Triage→03Medium [15:23:06] 06cloud-services-team, 10Cloud-VPS, 07Documentation, 10Sustainability (Incident Followup): Add runbook to ProjectProxyMainProxyDown, and reconsider severity - https://phabricator.wikimedia.org/T381107#10441323 (10joanna_borun) p:05Triage→03Medium [15:23:27] 06cloud-services-team, 10Cloud-VPS, 07Documentation, 10Sustainability (Incident Followup): Add runbook to ProjectProxyMainProxyDown, and reconsider severity - https://phabricator.wikimedia.org/T381107#10441324 (10joanna_borun) p:05Medium→03High [15:23:37] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-22 [15:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:23:58] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Drop support for VMs with .wmflabs FQDNs - https://phabricator.wikimedia.org/T380679#10441326 (10Andrew) a:03Andrew [15:24:19] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Drop support for VMs with .wmflabs FQDNs - https://phabricator.wikimedia.org/T380679#10441327 (10joanna_borun) 05Stalled→03In progress p:05Triage→03Medium [15:25:10] 06cloud-services-team, 10Toolforge: Can't pip install mysqlclient on Toolforge - https://phabricator.wikimedia.org/T349341#10441331 (10joanna_borun) p:05Triage→03Low [15:25:56] 06cloud-services-team, 10Cloud-VPS: Do not create DNS zones for projects outside default domain - https://phabricator.wikimedia.org/T380095#10441332 (10joanna_borun) p:05Triage→03Low [15:26:16] 06cloud-services-team, 10Cloud-VPS: Do not create DNS zones for projects outside default domain - https://phabricator.wikimedia.org/T380095#10441337 (10Andrew) a:03Andrew assigning to myself to make sure this is already fixed :) [15:26:33] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-38 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:26:42] 10Data-Services, 06DBA, 06Privacy Engineering: Create views for SecurePoll db tables in Toolforge replicas - https://phabricator.wikimedia.org/T381197#10441340 (10joanna_borun) [15:26:44] 06cloud-services-team, 10Cloud-VPS: Do not create DNS zones for projects outside default domain - https://phabricator.wikimedia.org/T380095#10441342 (10taavi) And we should probably delete all the zones created by this [15:27:06] 10Data-Services, 06Data-Engineering, 06DBA, 06Privacy Engineering: Create views for SecurePoll db tables in Toolforge replicas - https://phabricator.wikimedia.org/T381197#10441346 (10joanna_borun) [15:27:23] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T381602#10441347 (10dcaro) 05Open→03Resolved a:03dcaro [15:28:25] 06cloud-services-team, 10Toolforge: toolforge jobs load errors with 404 repetatively - https://phabricator.wikimedia.org/T381273#10441355 (10joanna_borun) p:05Triage→03Medium [15:29:22] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-22 [15:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:29:35] 06cloud-services-team, 10Toolforge, 10wikitech.wikimedia.org, 10Diffusion, and 2 others: Document diffusion->github mirroring to https://github.com/toolforge/ on wikitech - https://phabricator.wikimedia.org/T361859#10441361 (10joanna_borun) p:05Triage→03Medium [15:29:39] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-42 [15:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:30:54] 06cloud-services-team, 10Toolforge: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble - https://phabricator.wikimedia.org/T380127#10441375 (10joanna_borun) p:05Triage→03Medium [15:31:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-35 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:32:05] 06cloud-services-team: SystemdUnitDown The systemd unit purge_vm_rbd_images.service on node cloudcontrol1005 has been failing for more than two hours. - https://phabricator.wikimedia.org/T382770#10441388 (10Andrew) 05Open→03Resolved a:03Andrew This is no longer happening. [15:33:54] 06cloud-services-team, 10Cloud-VPS, 10Library-Card-Platform, 06Moderator-Tools-Team: The Wikipedia Library emails aren't being received by @wikimedia.org email inboxes - https://phabricator.wikimedia.org/T382314#10441400 (10fnegri) Which SMTP server are you using right now that is failing? [15:35:06] 10Data-Services, 06Data-Engineering-Icebox, 13Patch-For-Review: Log_param is redacted in wiki replica when only comment and/or user should be - https://phabricator.wikimedia.org/T301943#10441407 (10Andrew) *bump* This is a data engineering task but it's pretty simple isn't it? [15:35:11] 10Data-Services, 06Data-Engineering-Icebox, 13Patch-For-Review: Log_param is redacted in wiki replica when only comment and/or user should be - https://phabricator.wikimedia.org/T301943#10441408 (10joanna_borun) [15:35:22] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-42 [15:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:35:30] 06cloud-services-team, 10Cloud-VPS, 10Toolforge, 07Kubernetes: Allow Toolforge roots to use the cookbook to reboot k8s worker nodes (without wmcs-root) - https://phabricator.wikimedia.org/T382977#10441411 (10joanna_borun) p:05Triage→03Medium [15:36:21] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: prometheus-openstack-stale-puppet-certs crashing on deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T383153#10441417 (10Andrew) 05Open→03Resolved [15:37:03] 10cloud-services-team (Hardware), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10441433 (10joanna_borun) p:05Triage→03High [15:37:04] 06cloud-services-team, 10Horizon: Clean up horizon/deploy branches - https://phabricator.wikimedia.org/T382957#10441434 (10Andrew) 05Open→03Resolved [15:37:38] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10441441 (10joanna_borun) p:05Triage→03High [15:38:45] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-38 [15:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:40:30] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-38 [15:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:40:51] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-36 [15:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:41:33] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-22 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:43:01] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 10PAWS, 13Patch-For-Review: Restrict outbound connectivity from PAWS hosts - https://phabricator.wikimedia.org/T381373#10441468 (10rook) 05In progress→03Resolved [15:45:47] 10PAWS: Upgrade to k8s 1.28 - https://phabricator.wikimedia.org/T381503#10441483 (10rook) [15:45:49] 06cloud-services-team, 10Cloud-VPS: Upgrade cloud-vps openstack to version 'Dalmation' - https://phabricator.wikimedia.org/T381499#10441484 (10rook) [15:46:05] 10PAWS: Upgrade to k8s 1.28 - https://phabricator.wikimedia.org/T381503#10441485 (10rook) 05Open→03Stalled [15:46:33] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-22 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:46:39] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-36 [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:53:25] 10PAWS, 10Pywikibot, 10Pywikibot-login.py, 07Pywikibot-Wikidata: Querying wikidata with pywikibot fails for items with images when user is not registered for commons - https://phabricator.wikimedia.org/T168222#10441524 (10rook) Pulse check, is this still happening? [15:55:00] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-58 (T383238) [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:55:04] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [15:57:43] 10PAWS, 10LibUp: LibUp bot opening multiple upgrade notices for same lib - https://phabricator.wikimedia.org/T340979#10441547 (10rook) This hasn't happened in awhile. Seems resolved. [15:57:52] 10PAWS, 10LibUp: LibUp bot opening multiple upgrade notices for same lib - https://phabricator.wikimedia.org/T340979#10441548 (10rook) 05Open→03Resolved [16:00:39] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-58 (T383238) [16:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:00:44] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:02:40] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10441563 (10Reedy) ` reedy@deploy2002:~$ mwscript extensions/CentralAuth/maintenance/createLocalAccount.php --wiki=labswiki "Gkyziridis" DEPRECATION WARNING: Maintenance scripts are moving to... [16:20:13] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-35 (T383238) [16:20:18] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:22:36] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [16:24:08] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [16:25:37] 06cloud-services-team, 10Toolforge, 10Elasticsearch, 07Epic: Deploy multi-tenant OpenSearch cluster as replacement for Elasticsearch - https://phabricator.wikimedia.org/T348943#10441659 (10dcausse) Linking T379288 since we might explore security features too for the (upcoming) WMF internal opensearch clust... [16:25:38] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-35 (T383238) [16:25:42] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:32:48] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: Persist maintain-harbor logs - https://phabricator.wikimedia.org/T383081#10441690 (10Raymond_Ndibe) 05Open→03In progress [16:33:23] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-72 (T383238) [16:33:27] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:38:45] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-72 (T383238) [16:38:49] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:45:47] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-65 (T383238) [16:45:51] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:51:09] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-65 (T383238) [16:51:13] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:52:11] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-57 (T383238) [16:54:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-35 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:57:32] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-57 (T383238) [16:57:36] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:00:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10441804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with OS bookworm [17:01:08] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-48 (T383238) [17:06:33] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-48 (T383238) [17:06:38] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:11:23] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-12 (T383238) [17:14:16] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-12 (T383238) [17:14:20] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:22:08] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-44 (T383238) [17:22:13] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:22:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238#10441891 (10dcaro) [17:27:31] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-44 (T383238) [17:27:34] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:28:39] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-76 (T383238) [17:29:30] 10VPS-Projects, 06Content-Transform-Team-WIP, 10Parsoid, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): upgrade nodejs on parsing-qa-02 - https://phabricator.wikimedia.org/T349941#10441909 (10ssastry) [17:33:59] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-76 (T383238) [17:34:03] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:34:10] 10VPS-Projects, 06Content-Transform-Team-WIP, 10Parsoid, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Create a bookwork-imaged VM ctt-qa-03 to replace parsing-qa-03 - https://phabricator.wikimedia.org/T383249 (10ssastry) 03NEW [17:35:52] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] 2025-01-08 nova-api-metadata.service down on cloudcontrol1005 - https://phabricator.wikimedia.org/T383203#10441945 (10dcaro) btw. restarting the service made it come back... [17:35:54] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-26 (T383238) [17:39:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10441962 (10Jclark-ctr) [17:40:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10441967 (10Jclark-ctr) ` Failed to load ldlinux.c32 Boot failed: press a key to retry, or wait for reset... .............. ` downgraded firmware on nic and lo... [17:41:14] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-26 (T383238) [17:41:19] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:41:59] 10VPS-Projects, 06Content-Transform-Team-WIP, 10Parsoid, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251 (10ssastry) 03NEW [17:43:07] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-37 (T383238) [17:44:11] 10VPS-Projects, 06Content-Transform-Team-WIP, 10Parsoid, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10441998 (10ssastry) Can I use this occasion to bump up the VM capacit... [17:48:29] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-37 (T383238) [17:48:30] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-67 (T383238) [17:48:33] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:50:36] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442034 (10ssastry) [17:53:51] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-67 (T383238) [17:53:53] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-27 (T383238) [17:53:55] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [17:59:11] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-27 (T383238) [17:59:12] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-8 (T383238) [17:59:15] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:00:55] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442077 (10ssastry) I'm going to file a separate ticket for quota increases. [18:01:23] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442081 (10ssastry) [18:04:26] 10VPS-Projects, 06Content-Transform-Team-WIP, 10Parsoid, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252 (10ssastry) 03NEW [18:04:31] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-8 (T383238) [18:04:32] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-41 (T383238) [18:04:36] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:05:24] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252#10442112 (10ssastry) [18:06:09] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-41 (T383238) [18:06:10] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-47 (T383238) [18:07:57] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252#10442130 (10dcaro) +1 [18:07:59] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442132 (10Andrew) +1 [18:08:07] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252#10442134 (10Andrew) +1 [18:08:10] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442136 (10dcaro) ?1 [18:09:09] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442139 (10Andrew) this bump is temporary for a rebuild and can be reverted after.... [18:09:10] 10Cloud-VPS (Quota-requests), 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252#10442140 (10JJMC89) [18:12:22] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-47 (T383238) [18:12:23] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-1 (T383238) [18:12:27] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:12:32] !log andrew@cloudcumin1001 wikitextexp START - Cookbook wmcs.openstack.quota_increase [18:12:40] !log andrew@cloudcumin1001 wikitextexp END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [18:14:09] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-1 (T383238) [18:14:11] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17 (T383238) [18:14:52] !log andrew@cloudcumin1001 wikitextexp START - Cookbook wmcs.openstack.quota_increase (T383251) [18:14:54] T383251: Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251 [18:15:00] !log andrew@cloudcumin1001 wikitextexp END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T383251) [18:16:17] !log andrew@cloudcumin1001 wikitextexp START - Cookbook wmcs.openstack.quota_increase (T383252) [18:16:19] T383252: Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252 [18:16:25] !log andrew@cloudcumin1001 wikitextexp END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T383252) [18:17:14] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10442183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with OS bookworm complete... [18:18:00] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10442184 (10Jclark-ctr) [18:18:07] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10442195 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [18:19:44] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17 (T383238) [18:19:49] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:24:25] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10442221 (10dcaro) [18:25:01] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10442225 (10dcaro) Reworded the choices to reflect that's two things being decided at most, not one. [18:26:13] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-32 (T383238) [18:26:17] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:26:38] (03PS4) 10Majavah: Partially convert toolinfo to Codex [labs/striker] - 10https://gerrit.wikimedia.org/r/1108150 (https://phabricator.wikimedia.org/T380114) [18:26:39] (03PS4) 10Majavah: Use Codex-based layout instead of Bootstrap container-fluid [labs/striker] - 10https://gerrit.wikimedia.org/r/1108151 (https://phabricator.wikimedia.org/T380114) [18:26:39] (03PS4) 10Majavah: Convert notices on tool page to Codex cards [labs/striker] - 10https://gerrit.wikimedia.org/r/1108152 (https://phabricator.wikimedia.org/T380114) [18:26:39] (03PS4) 10Majavah: Convert tool metadata to Codex tables [labs/striker] - 10https://gerrit.wikimedia.org/r/1108153 (https://phabricator.wikimedia.org/T380114) [18:26:39] (03PS4) 10Majavah: Drop bootstrap-theme.min.css [labs/striker] - 10https://gerrit.wikimedia.org/r/1108154 [18:26:42] (03PS4) 10Majavah: Convert SSH key view to Codex [labs/striker] - 10https://gerrit.wikimedia.org/r/1108155 (https://phabricator.wikimedia.org/T380114) [18:28:49] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10442257 (10dcaro) [18:29:20] (03CR) 10Majavah: [C:03+2] Partially convert toolinfo to Codex [labs/striker] - 10https://gerrit.wikimedia.org/r/1108150 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [18:29:27] (03CR) 10Majavah: [C:03+2] Use Codex-based layout instead of Bootstrap container-fluid [labs/striker] - 10https://gerrit.wikimedia.org/r/1108151 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [18:29:31] (03CR) 10Majavah: [C:03+2] Convert notices on tool page to Codex cards [labs/striker] - 10https://gerrit.wikimedia.org/r/1108152 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [18:29:50] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10442260 (10dcaro) [18:30:41] (03Merged) 10jenkins-bot: Partially convert toolinfo to Codex [labs/striker] - 10https://gerrit.wikimedia.org/r/1108150 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [18:30:45] (03Merged) 10jenkins-bot: Use Codex-based layout instead of Bootstrap container-fluid [labs/striker] - 10https://gerrit.wikimedia.org/r/1108151 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [18:30:48] (03Merged) 10jenkins-bot: Convert notices on tool page to Codex cards [labs/striker] - 10https://gerrit.wikimedia.org/r/1108152 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [18:33:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-32 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:34:18] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-32 (T383238) [18:34:19] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-43 (T383238) [18:34:22] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:38:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-32 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:39:32] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-43 (T383238) [18:39:36] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [18:53:12] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Remove hardcoded NFT rules related to PAWS workers - https://phabricator.wikimedia.org/T383261 (10fnegri) 03NEW [18:54:21] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Remove hardcoded NFT rules related to PAWS workers - https://phabricator.wikimedia.org/T383261#10442380 (10fnegri) p:05Triage→03Medium [19:59:14] FIRING: KernelError: Server cloudcontrol1011 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcontrol1011 - https://alerts.wikimedia.org/?q=alertname%3DKernelError [19:59:14] FIRING: KernelWarning: Server cloudcontrol1011 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcontrol1011 - https://alerts.wikimedia.org/?q=alertname%3DKernelWarning [19:59:19] 06cloud-services-team: KernelError Server cloudcontrol1011 may have kernel errors - https://phabricator.wikimedia.org/T383270 (10phaultfinder) 03NEW [20:16:22] 10Cloud-VPS (Quota-requests), 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252#10442643 (10Andrew) 05Open→03Resolved a:03A... [20:16:24] 10VPS-Projects, 06Content-Transform-Team-WIP, 07Essential-Work, 10Parsoid-Read-Views (Phase 1 - DiscussionTools support): Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251#10442646 (10Andrew) 05Open→03Resolved a:03Andrew [21:02:52] 06cloud-services-team, 10Toolforge, 07Epic: [WIP] Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10442764 (10bd808) Are the various "running in k8s" statements in the task description specifically about deployment to the Kubernetes cluster within To... [22:07:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudservices1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:12:21] 06cloud-services-team, 10Cloud-VPS: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129#10442924 (10Andrew) Removing the aliaser in codfw1dev broke the connection between the enc-api and the puppetserver. So this needs more research, it may be that the aliaser is still need... [22:22:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudservices1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:40:11] vivian-rook closed https://github.com/toolforge/paws/pull/475 [23:59:14] FIRING: KernelError: Server cloudcontrol1011 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcontrol1011 - https://alerts.wikimedia.org/?q=alertname%3DKernelError [23:59:14] FIRING: KernelWarning: Server cloudcontrol1011 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudcontrol1011 - https://alerts.wikimedia.org/?q=alertname%3DKernelWarning