[00:05:55] FIRING: MaxConntrack: Max conntrack at 80.66% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:16:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:26:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:35:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:43:20] (03PS1) 10Bovimacoco: fix: migrate hardcoded Secure Migrate URLs to .env with fallback validation [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166053 (https://phabricator.wikimedia.org/T390402) [00:44:56] (03Abandoned) 10Bovimacoco: T390397 Enforce Strict Typing. Bug=T390397 [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157764 (owner: 10Bovimacoco) [00:45:00] (03Restored) 10Bovimacoco: T390397 Enforce Strict Typing. Bug=T390397 [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157764 (owner: 10Bovimacoco) [00:46:11] (03Abandoned) 10Bovimacoco: Secure: Migrate hardcoded URLs to .env with validation [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157840 (https://phabricator.wikimedia.org/T390402) (owner: 10Bovimacoco) [00:50:55] RESOLVED: MaxConntrack: Max conntrack at 80.36% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:51:55] FIRING: MaxConntrack: Max conntrack at 82.03% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:56:55] RESOLVED: MaxConntrack: Max conntrack at 81.6% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:00:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:03:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:28:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:30:58] (03PS1) 10Bovimacoco: fix: enforce Strict Typing [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166054 (https://phabricator.wikimedia.org/T390397) [01:31:25] (03Abandoned) 10Bovimacoco: T390397 Enforce Strict Typing. Bug=T390397 [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157764 (owner: 10Bovimacoco) [01:42:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:57:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:49:50] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [06:04:15] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [06:23:38] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [06:31:04] (03close) 10raymond-ndibe: [api.jobs] health_check.type deprecation patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/173 (https://phabricator.wikimedia.org/T396210 https://phabricator.wikimedia.org/T396236) [06:33:50] (03approved) 10raymond-ndibe: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 (owner: 10dcaro) [06:33:53] (03update) 10raymond-ndibe: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 (owner: 10dcaro) [07:01:30] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on gitlab-runners-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [08:19:31] (03merge) 10dcaro: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [08:21:20] (03open) 10taavi: logging: alloy: Deploy to more workers [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/862 (https://phabricator.wikimedia.org/T386480) [08:21:20] (03update) 10taavi: logging: alloy: Deploy to more workers [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/862 (https://phabricator.wikimedia.org/T386480) [08:22:18] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.385-20250703081944-b099bdb6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/863 (https://phabricator.wikimedia.org/T398281) [08:24:44] (03update) 10taavi: logging: alloy: Deploy to more workers [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/862 (https://phabricator.wikimedia.org/T386480) [08:25:09] (03approved) 10dcaro: logging: alloy: Deploy to more workers [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/862 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [08:25:13] (03merge) 10dcaro: logging: alloy: Deploy to more workers [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/862 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [08:25:59] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component logging [08:26:04] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component logging [08:26:28] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component logging [08:27:32] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolforge jobs API, schedule vs schedule_actual, and API behaviour change between Feb and June 2025 - https://phabricator.wikimedia.org/T398281#10971396 (10dcaro) a:03dcaro [08:27:46] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolforge jobs API, schedule vs schedule_actual, and API behaviour change between Feb and June 2025 - https://phabricator.wikimedia.org/T398281#10971398 (10dcaro) p:05Triage→03Medium [08:27:55] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolforge jobs API, schedule vs schedule_actual, and API behaviour change between Feb and June 2025 - https://phabricator.wikimedia.org/T398281#10971400 (10dcaro) 05Open→03In progress [08:28:16] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component logging [08:28:49] 06cloud-services-team, 10Toolforge: tools-static.wmflabs.org down (504) 2025-06-28 - https://phabricator.wikimedia.org/T398103#10971404 (10taavi) 05Open→03Resolved a:03dcaro This specific instance is resolved. Long-term fixes are tracked in {T397634}. [08:52:53] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10971482 (10Jgiannelos) Since we are creating a more proper environment for our experiments, can we maybe delete the entity-detection project and expand the mobileappsperofrmance quota with... [09:07:58] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:14:38] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [10:23:57] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:34:08] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [10:43:47] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:46:49] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10971961 (10dcaro) I suspect that there's too many possible states for a deployment to express with an HTTP return code. Hmm.... I think... [10:48:12] (03approved) 10dcaro: jobs-api: bump to 0.0.385-20250703081944-b099bdb6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/863 (https://phabricator.wikimedia.org/T398281) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:48:17] (03update) 10dcaro: jobs-api: bump to 0.0.385-20250703081944-b099bdb6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/863 (https://phabricator.wikimedia.org/T398281) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:48:34] (03merge) 10dcaro: jobs-api: bump to 0.0.385-20250703081944-b099bdb6 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/863 (https://phabricator.wikimedia.org/T398281) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:50:37] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: toolforge jobs API, schedule vs schedule_actual, and API behaviour change between Feb and June 2025 - https://phabricator.wikimedia.org/T398281#10971971 (10dcaro) 05In progress→03Resolved [10:54:07] 06cloud-services-team, 10Toolforge: [components-api,beta] Generated configs should contain cpu values as numbers, not strings - https://phabricator.wikimedia.org/T398497#10971976 (10dcaro) This one might be a bit tricky, as the value of `cpu` might have units in it (ex. `500m`), so we would have to check exact... [10:55:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.96%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [11:27:12] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10972058 (10Sascha) Sounds like people would prefer a custom client instead of curl. In that case, the HTTP status code doesn't really mat... [12:22:49] (03open) 10taavi: logging: alloy: Deploy to even more nodes [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/864 (https://phabricator.wikimedia.org/T386480) [12:22:51] (03update) 10taavi: logging: alloy: Deploy to even more nodes [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/864 (https://phabricator.wikimedia.org/T386480) [12:29:10] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/22 [12:29:11] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/39 [12:29:12] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/3 [12:31:26] (03open) 10fnegri: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/58 (https://phabricator.wikimedia.org/T398170) [12:31:27] (03open) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] (main-a161) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/59 (https://phabricator.wikimedia.org/T398170) [12:42:20] (03update) 10taavi: logging: alloy: Deploy to even more nodes [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/864 (https://phabricator.wikimedia.org/T386480) [12:44:02] (03update) 10fnegri: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/60 (https://phabricator.wikimedia.org/T398170) [12:44:02] (03update) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61 (https://phabricator.wikimedia.org/T398170) [12:44:03] (03open) 10fnegri: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/60 (https://phabricator.wikimedia.org/T398170) [12:44:05] (03open) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61 (https://phabricator.wikimedia.org/T398170) [12:44:13] (03update) 10fnegri: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/60 (https://phabricator.wikimedia.org/T398170) [12:44:23] (03update) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61 (https://phabricator.wikimedia.org/T398170) [12:44:58] (03close) 10fnegri: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/58 (https://phabricator.wikimedia.org/T398170) [12:45:06] (03close) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] (main-a161) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/59 (https://phabricator.wikimedia.org/T398170) [12:55:54] (03approved) 10dcaro: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/60 (https://phabricator.wikimedia.org/T398170) (owner: 10fnegri) [12:57:05] (03approved) 10dcaro: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61 (https://phabricator.wikimedia.org/T398170) (owner: 10fnegri) [12:59:34] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T398353#10972419 (10Andrew) 05Open→03Resolved a:03Andrew [13:00:17] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol1006:9100 - https://phabricator.wikimedia.org/T398351#10972423 (10Andrew) 05Open→03Resolved a:03Andrew [13:00:20] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2010-dev:9100 - https://phabricator.wikimedia.org/T398349#10972425 (10Andrew) 05Open→03Resolved a:03Andrew [13:00:53] 06cloud-services-team, 10Toolforge: [components-api,beta] Generated configs should contain cpu values as numbers, not strings - https://phabricator.wikimedia.org/T398497#10972430 (10dcaro) p:05Triage→03Low [13:00:56] 06cloud-services-team, 10Cloud-VPS, 10Bitu, 06Infrastructure-Foundations: developer service accounts and email - https://phabricator.wikimedia.org/T398074#10972431 (10Andrew) p:05Triage→03Low [13:02:54] (03merge) 10fnegri: volumes: remove unused tools-db-6-data-temp [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/60 (https://phabricator.wikimedia.org/T398170) [13:02:56] (03update) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61 (https://phabricator.wikimedia.org/T398170) [13:03:00] 10Toolforge (Toolforge iteration 21): [components-api,api-gateway] allow getting a deployment status using the deployment token - https://phabricator.wikimedia.org/T398623 (10dcaro) 03NEW [13:03:07] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T397882#10972449 (10Andrew) 05Open→03Resolved a:03Andrew [13:03:21] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10972453 (10dcaro) We are working on getting a client that's easily "installable" (ex. single binary {T356262}), and we have [[ https://py... [13:03:40] 10Toolforge (Toolforge iteration 21): [components-api,api-gateway] allow getting a deployment status using the deployment token - https://phabricator.wikimedia.org/T398623#10972466 (10dcaro) [13:03:41] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10972467 (10dcaro) [13:03:54] 10Toolforge (Toolforge iteration 21): [components-api,api-gateway] allow getting a deployment status using the deployment token - https://phabricator.wikimedia.org/T398623#10972468 (10dcaro) p:05Triage→03High [13:04:07] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10972469 (10dcaro) p:05Triage→03High [13:05:45] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 07Upstream: Trove managed instances should be dual stack - https://phabricator.wikimedia.org/T398189#10972483 (10Andrew) p:05Triage→03Medium [13:06:28] FIRING: InstanceDown: Project gitlab-runners instance runner-1022 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:09:25] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-24 [13:10:40] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: [wmcs-cookbooks] create_instance_with_prefix should not use vlan/legacy - https://phabricator.wikimedia.org/T398625 (10fnegri) 03NEW [13:11:28] RESOLVED: InstanceDown: Project gitlab-runners instance runner-1022 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:12:16] 10Cloud-VPS (Project-requests): Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254#10972512 (10dcaro) +1 I have some questions if you don't mind. What is the resource usage you expect for it? (cpu, ram, disk, ...) [13:12:21] 10Cloud-VPS (Project-requests): Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254#10972514 (10fnegri) +1 [13:14:29] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [13:15:20] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-24 [13:16:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet2006-dev.codfw.wmnet' [13:16:55] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [13:17:02] (03merge) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [13:19:51] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.131-20250703131710-86770608 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/865 (https://phabricator.wikimedia.org/T395039) [13:22:36] (03approved) 10dcaro: maintain-kubeusers: bump to 0.0.178-20250702084425-15f2dd20 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/860 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:22:47] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.178-20250702084425-15f2dd20 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/860 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:23:00] (03merge) 10taavi: logging: alloy: Deploy to even more nodes [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/864 (https://phabricator.wikimedia.org/T386480) [13:23:05] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [13:23:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component logging [13:23:46] (03update) 10raymond-ndibe: [jobs-api] refactor quota models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/164 (https://phabricator.wikimedia.org/T389118) [13:23:54] (03PS3) 10David Caro: Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) [13:24:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet2006-dev.codfw.wmnet' [13:24:20] (03CR) 10CI reject: [V:04-1] Move to gitlab [labs/toollabs] - 10https://gerrit.wikimedia.org/r/1165027 (https://phabricator.wikimedia.org/T398202) (owner: 10David Caro) [13:26:18] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component logging [13:27:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet2005-dev.codfw.wmnet' [13:27:50] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [13:31:24] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [13:31:48] (03merge) 10fnegri: volumes: remove tools-db-5-data [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61 (https://phabricator.wikimedia.org/T398170) [13:35:01] FIRING: NTPNoSynced: NTP not synced on cloudnet2006-dev:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [13:35:15] (03update) 10raymond-ndibe: [jobs-cli] health_check and quota refactor [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/97 (https://phabricator.wikimedia.org/T389118) [13:35:37] (03update) 10raymond-ndibe: [jobs-cli] quota refactor [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/97 (https://phabricator.wikimedia.org/T389118) [13:36:05] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet2005-dev.codfw.wmnet' [13:36:43] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [13:37:44] (03update) 10raymond-ndibe: [jobs-cli] quota refactor [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/97 (https://phabricator.wikimedia.org/T389118) [13:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:40:44] (03open) 10fnegri: volumes: fix typo in volume description [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/62 (https://phabricator.wikimedia.org/T398170) [13:40:46] (03update) 10fnegri: volumes: fix typo in volume description [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/62 (https://phabricator.wikimedia.org/T398170) [13:41:30] (03approved) 10dcaro: volumes: fix typo in volume description [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/62 (https://phabricator.wikimedia.org/T398170) (owner: 10fnegri) [13:41:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet2006-dev.codfw.wmnet' [13:44:13] (03merge) 10fnegri: volumes: fix typo in volume description [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/62 (https://phabricator.wikimedia.org/T398170) [13:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:49:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet2006-dev.codfw.wmnet' [13:49:06] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:50:01] FIRING: [2x] NTPNoSynced: NTP not synced on cloudnet2005-dev:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [13:50:24] (03approved) 10dcaro: components-api: bump to 0.0.131-20250703131710-86770608 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/865 (https://phabricator.wikimedia.org/T395039) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:50:27] (03update) 10dcaro: components-api: bump to 0.0.131-20250703131710-86770608 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/865 (https://phabricator.wikimedia.org/T395039) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:51:28] FIRING: InstanceDown: Project gitlab-runners instance runner-1023 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:52:01] (03merge) 10dcaro: components-api: bump to 0.0.131-20250703131710-86770608 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/865 (https://phabricator.wikimedia.org/T395039) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:52:40] (03approved) 10dcaro: cancel: add new subcommand to cancel a deployment [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/45 [13:52:43] (03merge) 10dcaro: cancel: add new subcommand to cancel a deployment [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/45 [13:53:15] (03approved) 10dcaro: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) (owner: 10chuckonwumelu) [13:53:21] (03update) 10dcaro: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) (owner: 10chuckonwumelu) [13:54:35] (03merge) 10dcaro: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) (owner: 10chuckonwumelu) [13:55:53] (03open) 10dcaro: d/changelog: bump to 0.0.12 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/50 (https://phabricator.wikimedia.org/T395039 https://phabricator.wikimedia.org/T395077) [13:55:57] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [13:56:28] RESOLVED: InstanceDown: Project gitlab-runners instance runner-1023 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:58:38] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10972692 (10Andrew) >>! In T398405#10971482, @Jgiannelos wrote: > Since we are creating a more proper environment for our experiments, can we maybe delete the entity-detection project and ex... [14:00:01] FIRING: NTPNoSynced: NTP not synced on cloudnet2006-dev:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [14:00:28] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-cli [14:02:22] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-cli [14:04:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol2005-dev.codfw.wmnet' [14:06:03] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-cli [14:09:00] (03approved) 10dcaro: d/changelog: bump to 0.0.12 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/50 (https://phabricator.wikimedia.org/T395039 https://phabricator.wikimedia.org/T395077) [14:09:04] (03merge) 10dcaro: d/changelog: bump to 0.0.12 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/50 (https://phabricator.wikimedia.org/T395039 https://phabricator.wikimedia.org/T395077) [14:14:35] (03open) 10dcaro: cancel: add the missing autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/51 [14:15:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol2005-dev.codfw.wmnet' [14:17:23] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039#10972825 (10dcaro) a:03dcaro [14:17:26] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039#10972827 (10dcaro) 05Open→03In progress [14:18:56] (03update) 10raymond-ndibe: [jobs-cli] quota refactor [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/97 (https://phabricator.wikimedia.org/T389118) [14:26:17] (03update) 10raymond-ndibe: [jobs-cli] quota refactor [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/97 (https://phabricator.wikimedia.org/T389118) [14:30:01] FIRING: NTPNoSynced: NTP not synced on cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [14:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:41:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,designate [14:42:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,designate [14:43:53] (03PS1) 10Andrew Bogott: upgrade_openstack_node: apt-get upgrade instead of dist-upgrade [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166214 [14:44:42] (03CR) 10Majavah: [C:03+1] upgrade_openstack_node: apt-get upgrade instead of dist-upgrade [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166214 (owner: 10Andrew Bogott) [14:46:08] 10Toolforge (Toolforge iteration 21): [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039#10972901 (10dcaro) 05In progress→03Resolved [14:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:48:46] (03update) 10dcaro: global: don't return tracebacks to users [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/102 [14:51:45] (03CR) 10Andrew Bogott: [C:03+2] upgrade_openstack_node: apt-get upgrade instead of dist-upgrade [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166214 (owner: 10Andrew Bogott) [14:55:01] FIRING: [3x] NTPNoSynced: NTP not synced on cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [14:55:06] 10Cloud-VPS (Project-requests): Request creation of "entity-detection" VPS project - https://phabricator.wikimedia.org/T246362#10972926 (10Jgiannelos) This project can be deleted but since we are actively using the resources we would like them to be reused in mobileappsperformance project. [14:56:20] 10Cloud-VPS (Project-requests): Request creation of "entity-detection" VPS project - https://phabricator.wikimedia.org/T246362#10972932 (10Jgiannelos) 05Resolved→03Open [14:56:47] (03Merged) 10jenkins-bot: upgrade_openstack_node: apt-get upgrade instead of dist-upgrade [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1166214 (owner: 10Andrew Bogott) [14:59:20] 10Cloud-VPS (Quota-requests): Change quota for mobileappsperformance account - https://phabricator.wikimedia.org/T398638 (10Jgiannelos) 03NEW [14:59:58] 10Cloud-VPS (Project-requests): Request creation of "entity-detection" VPS project - https://phabricator.wikimedia.org/T246362#10972963 (10taavi) 05Open→03Resolved Please create a new task instead of re-using this. [15:00:57] (03update) 10raymond-ndibe: [jobs-cli] quota refactor [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/97 (https://phabricator.wikimedia.org/T389118) [15:02:31] (03open) 10dcaro: openapi: added several servers [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/107 [15:03:28] (03close) 10dcaro: openapi spec: Add servers [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/104 (owner: 10addshore) [15:07:07] (03merge) 10dcaro: global: don't return tracebacks to users [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/102 [15:09:54] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.132-20250703150719-80fbf729 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/866 [15:10:57] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [15:15:46] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [15:39:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:39:53] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2005-dev:9100 - https://phabricator.wikimedia.org/T398642 (10phaultfinder) 03NEW [15:42:25] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643 (10bd808) 03NEW [15:49:34] 06cloud-services-team, 10Toolforge: [jobs-api] Jobs API should query logs from Loki - https://phabricator.wikimedia.org/T398645 (10taavi) 03NEW p:05Triage→03High [15:49:52] 06cloud-services-team, 10Toolforge: [jobs-api] Jobs API should query logs from Loki - https://phabricator.wikimedia.org/T398645#10973139 (10taavi) p:05High→03Medium [15:49:57] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10973141 (10bd808) One workaround for this that I could imagine is adding a new `/v1/yaml` (route name is easily debatable) route... [15:50:43] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10973142 (10bd808) [15:52:25] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10973146 (10taavi) I'm fairly sure the provider actually transforms the YAML string into a JSON object that it sends to the API,... [15:52:31] 06cloud-services-team, 10Toolforge: [jobs-api] Jobs API should query logs from Loki - https://phabricator.wikimedia.org/T398645#10973147 (10taavi) a:03taavi [15:53:17] 06cloud-services-team, 10Toolforge: [jobs-api] Jobs API should query logs from Loki - https://phabricator.wikimedia.org/T398645#10973148 (10taavi) [15:53:22] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [o11y,logging,infra] Deploy Loki to store Toolforge tool log data - https://phabricator.wikimedia.org/T386480#10973149 (10taavi) [15:57:03] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10973153 (10taavi) a:03taavi Per the last #cloud-services-team meeting, the rough plan here is: * Finish deploying Loki to... [16:01:49] 06cloud-services-team, 10Cloud-VPS, 07Upstream: trove: Unable to create user with IPv6 address as host - https://phabricator.wikimedia.org/T393760#10973160 (10taavi) [16:11:57] 06cloud-services-team, 10Toolforge: Move Kubernetes log source multi-pod handling from jobs-api to toolforge-weld - https://phabricator.wikimedia.org/T398647 (10taavi) 03NEW [16:13:07] 06cloud-services-team, 10Toolforge: Move Kubernetes log source multi-pod handling from jobs-api to toolforge-weld - https://phabricator.wikimedia.org/T398647#10973214 (10taavi) p:05Triage→03Medium [16:30:48] 06cloud-services-team, 10Toolforge: Move Kubernetes log source multi-pod handling from jobs-api to toolforge-weld - https://phabricator.wikimedia.org/T398647#10973238 (10dcaro) I have my doubts about this, as relatively soon we will move webservice to jobs-api, and then there will be no other users of the logg... [16:30:58] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:36:37] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [16:37:57] 06cloud-services-team, 10Toolforge: Move Kubernetes log source multi-pod handling from jobs-api to toolforge-weld - https://phabricator.wikimedia.org/T398647#10973253 (10taavi) Yeah, most of this will soon be obsolete. But right now it isn't, and I don't want to block the Loki work on the webservice->jobs migr... [16:40:13] 06cloud-services-team, 10Toolforge: Move Kubernetes log source multi-pod handling from jobs-api to toolforge-weld - https://phabricator.wikimedia.org/T398647#10973261 (10dcaro) >>! In T398647#10973253, @taavi wrote: > Yeah, most of this will soon be obsolete. But right now it isn't, and I don't want to block t... [16:44:28] (03open) 10dcaro: debian-builder-bookworm: add missing default arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/66 [16:44:50] (03approved) 10taavi: debian-builder-bookworm: add missing default arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/66 (owner: 10dcaro) [16:44:59] (03merge) 10dcaro: debian-builder-bookworm: add missing default arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/66 [16:45:10] (03approved) 10dcaro: components-api: bump to 0.0.132-20250703150719-80fbf729 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/866 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:45:12] (03merge) 10dcaro: components-api: bump to 0.0.132-20250703150719-80fbf729 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/866 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:50:02] (03open) 10dcaro: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 [16:52:11] (03update) 10dcaro: openapi: added several servers [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/107 [16:52:17] (03update) 10dcaro: toolconfig: make config_version explicitly nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/108 [16:52:28] (03close) 10dcaro: openapi spec: ToolConfig-Output config_version is nullable [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/103 (owner: 10addshore) [17:01:08] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973309 (10dcaro) >>! In T394333#10964951, @Andrew wrote: >>>! In T394333#10964303, @ayounsi wrote: >> @Andrew Would it be possible to use a single 25G up... [17:02:56] 10Toolforge (Toolforge iteration 21): [docs] enable docs linter in one of the repos - https://phabricator.wikimedia.org/T397949#10973313 (10dcaro) p:05Triage→03Low [17:03:04] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 tools-prometheus-8 stopped responding for a bit - https://phabricator.wikimedia.org/T397563#10973314 (10dcaro) p:05Triage→03High [17:03:14] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 Several correlated potentially network issues during the night - https://phabricator.wikimedia.org/T397566#10973315 (10dcaro) p:05Triage→03High [17:03:52] 06cloud-services-team, 10Data-Services: SQL function to recover the normal hostname, to install on Wiki Replica instances - https://phabricator.wikimedia.org/T344877#10973317 (10Wellverywell) [17:04:46] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.178-20250702084425-15f2dd20 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/860 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:05:24] (03update) 10andrew: README: use makrdown for nice presentation in gitlab [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/4 (owner: 10dcaro) [17:05:54] (03merge) 10andrew: README: use makrdown for nice presentation in gitlab [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/4 (owner: 10dcaro) [17:05:57] (03update) 10andrew: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 (owner: 10dcaro) [17:06:49] (03merge) 10andrew: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 (owner: 10dcaro) [17:06:52] (03update) 10andrew: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 (owner: 10dcaro) [17:07:08] (03update) 10andrew: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 (owner: 10dcaro) [17:08:46] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651 (10dcaro) 03NEW [17:10:43] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651#10973346 (10dcaro) [17:12:19] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10973350 (10bd808) >>! In T398643#10973146, @taavi wrote: > I'm fairly sure the provider actually transforms the YAML string into... [17:19:37] (03merge) 10andrew: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 (owner: 10dcaro) [17:48:58] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#10973432 (10bd808) >>! In T398643#10973350, @bd808 wrote: > This is not the same YAML as the `yaml.safe_dump` canonical form retu... [17:56:31] RESOLVED: NTPNoSynced: NTP not synced on cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [17:59:48] RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:11:28] 10Cloud-VPS (Project-requests): Request creation of wikidata-deleted VPS project - https://phabricator.wikimedia.org/T398254#10973580 (10Bovlb) >>! In T398254#10972512, @dcaro wrote: > What is the resource usage you expect for it? (cpu, ram, disk, ...) CPU: Low to moderate. The Solr query load is fairly modest,... [22:49:39] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [22:51:37] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 54.769 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [23:28:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-70 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses