[00:07:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:07:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:12:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:12:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:17:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:22:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:24:51] (03open) 10matmarex: Add note about using local time zone [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/8 [01:08:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:38:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:54:38] (03approved) 10sstefanova: utils: update deb build and bump setup [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T366674) [05:54:42] (03merge) 10sstefanova: utils: update deb build and bump setup [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T366674) [06:20:11] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [builds-cli, jobs-cli, envvars-cli] update package/version scripts to use bookworm - https://phabricator.wikimedia.org/T366674#9878252 (10Slst2020) 05In progress→03Resolved [07:37:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:38:41] 10Cloud Services Proposals: Decision request - kubernetes upgrade workgroup - https://phabricator.wikimedia.org/T363683#9878334 (10dcaro) 05Open→03Resolved [07:42:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:47:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:52:41] RESOLVED: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:57:55] (03update) 10sstefanova: Draft: api: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/jobs-api] (add_messages_to_all_responses) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/92 [08:16:23] 10Toolforge: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) - https://phabricator.wikimedia.org/T362869#9878385 (10dcaro) [08:16:30] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#9878386 (10dcaro) [08:16:36] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867#9878387 (10dcaro) [08:16:43] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#9878388 (10dcaro) [08:16:50] 06cloud-services-team, 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26 - https://phabricator.wikimedia.org/T327025#9878389 (10dcaro) [08:16:56] 06cloud-services-team, 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25 - https://phabricator.wikimedia.org/T316107#9878390 (10dcaro) [08:31:51] (03update) 10sstefanova: api: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/jobs-api] (add_messages_to_all_responses) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/92 [08:35:02] (03update) 10sstefanova: api: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/jobs-api] (add_messages_to_all_responses) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/92 [08:36:46] (03update) 10sstefanova: api: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/jobs-api] (add_messages_to_all_responses) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/92 [08:39:44] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge,storage] Provide per-tool access to cloud-vps object storage - https://phabricator.wikimedia.org/T358496#9878428 (10dcaro) >>! In T358496#9876797, @Andrew wrote: > > If a user is interacting with toolforge directly from their laptop us... [08:43:22] (03open) 10dcaro: defaults: increase deployments to 16 [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/40 (https://phabricator.wikimedia.org/T367002) [08:44:40] (03open) 10dcaro: DONOTMERGE_api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:44:41] 06cloud-services-team, 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased quota for editgroups Toolforge tool - https://phabricator.wikimedia.org/T367002#9878445 (10dcaro) 05Open→03In progress a:03dcaro [08:45:08] (03update) 10dcaro: DONOTMERGE_api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:45:20] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:45:51] (03update) 10sstefanova: api: prefix endpoints with /tool/{toolname}/ [repos/cloud/toolforge/jobs-api] (add_messages_to_all_responses) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/92 [08:49:48] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:50:24] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:51:58] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:55:53] (03update) 10dcaro: defaults: increase deployments to 16 [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/40 (https://phabricator.wikimedia.org/T367002) [08:57:02] (03update) 10aborrero: defaults: increase deployments to 16 [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/40 (https://phabricator.wikimedia.org/T367002) (owner: 10dcaro) [08:57:05] (03approved) 10aborrero: defaults: increase deployments to 16 [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/40 (https://phabricator.wikimedia.org/T367002) (owner: 10dcaro) [09:02:16] (03update) 10dcaro: defaults: increase deployments to 16 [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/40 (https://phabricator.wikimedia.org/T367002) [09:02:17] (03merge) 10dcaro: defaults: increase deployments to 16 [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/40 (https://phabricator.wikimedia.org/T367002) [09:04:40] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) [09:22:46] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9878558 (10dcaro) >>! In T348643#9868739, @wiki_willy wrote: > Ok, got it. Thanks for the info @dcaro. And just to... [09:26:10] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [09:26:23] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [09:33:26] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [09:33:26] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [09:33:42] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [09:34:52] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [09:35:48] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [09:43:17] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [09:45:43] (03PS1) 10Majavah: inventory: Update codfw1dev control nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1041555 [09:59:17] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [09:59:34] (03CR) 10Majavah: [C:03+2] inventory: Update codfw1dev control nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1041555 (owner: 10Majavah) [10:00:27] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [10:02:47] (03Merged) 10jenkins-bot: inventory: Update codfw1dev control nodes [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1041555 (owner: 10Majavah) [10:04:20] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [10:07:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:10:33] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [10:12:48] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:16:52] (03update) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [10:17:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:28:53] (03open) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] (bump_maintain-kubeusers) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [10:34:03] (03open) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [10:35:41] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:36:07] (03approved) 10aborrero: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:37:11] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:38:17] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [10:39:08] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:40:02] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:40:12] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:12:29] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:12:43] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:12:56] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [11:22:40] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [11:22:47] (03approved) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [11:22:51] (03merge) 10dcaro: maintain-kubeusers: bump to 0.0.147-20240611090228-a9acf2f7 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/323 (https://phabricator.wikimedia.org/T367002) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [11:22:52] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 (owner: 10dcaro) [11:24:05] (03update) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [11:25:46] (03CR) 10Ladsgroup: [C:03+2] gitlab: iterate over all pages of results to index missing repositories [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1039977 (https://phabricator.wikimedia.org/T366878) (owner: 10Brouberol) [11:27:11] (03Merged) 10jenkins-bot: gitlab: iterate over all pages of results to index missing repositories [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1039977 (https://phabricator.wikimedia.org/T366878) (owner: 10Brouberol) [11:29:04] 06cloud-services-team, 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased quota for editgroups Toolforge tool - https://phabricator.wikimedia.org/T367002#9879154 (10dcaro) This has been deployed :) ` tools.editgroups@tools-bastion-13:~$ toolforge jobs quota Running jobs... [11:29:11] 06cloud-services-team, 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased quota for editgroups Toolforge tool - https://phabricator.wikimedia.org/T367002#9879155 (10dcaro) 05In progress→03Resolved [11:34:08] (03approved) 10aborrero: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 (owner: 10dcaro) [11:35:24] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:35:32] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:35:40] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:35:54] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:38:27] (03update) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [11:38:37] (03update) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [11:39:05] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:39:15] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:39:23] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [11:39:34] (03update) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [11:41:49] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [11:43:17] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [11:43:58] (03open) 10sstefanova: d/changelog: bump to 0.0.7 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/46 [11:44:18] (03approved) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [11:44:28] (03merge) 10dcaro: maintain-kubeusers: remove overrides with defaults higher than them [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/324 [12:37:16] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [12:37:49] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [12:53:07] 10Toolforge (Toolforge iteration 11): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169 (10Slst2020) 03NEW [13:20:43] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:31:03] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:35:31] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:37:25] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry, 13Patch-For-Review: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9879766 (10KCVelaga_WMF) @fnegri I am curious about the status of this task, and especially the su... [13:37:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:38:49] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:40:50] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:42:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:43:21] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:43:23] (03update) 10dcaro: api: proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [13:43:40] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry, 13Patch-For-Review: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9879789 (10fnegri) @KCVelaga_WMF it's not working but I'm struggling to understand why. I had a qu... [13:44:04] 10Toolforge (Toolforge iteration 11): [api-gateway] Move authentication from the APIs - https://phabricator.wikimedia.org/T367179 (10dcaro) 03NEW [13:45:02] 10Toolforge (Toolforge iteration 11): [jobs-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367180 (10dcaro) 03NEW [13:45:33] 10Toolforge (Toolforge iteration 11): [envvars-api] Remove authentication and use api-gateway provided headers - https://phabricator.wikimedia.org/T367181 (10dcaro) 03NEW [13:45:36] 10Toolforge (Toolforge iteration 11): [api-gateway] Move authentication from the APIs - https://phabricator.wikimedia.org/T367179#9879804 (10dcaro) p:05Triage→03High [13:45:58] 10Toolforge (Toolforge iteration 11): [builds-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367182 (10dcaro) 03NEW [13:46:48] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry, 13Patch-For-Review: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9879841 (10fnegri) We could also try with Superset in the meantime, maybe that will be easier. I w... [13:47:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:48:12] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry, 13Patch-For-Review: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9879844 (10KCVelaga_WMF) @fnegri thanks for the update! [13:49:50] (03approved) 10sstefanova: d/changelog: bump to 0.0.7 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/46 [13:49:54] (03merge) 10sstefanova: d/changelog: bump to 0.0.7 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/46 [13:50:14] (03update) 10sstefanova: d/changelog: bump to 0.0.7 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/46 [13:52:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:06:00] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9879969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye [14:07:50] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [14:17:18] (03update) 10dcaro: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) (owner: 10aborrero) [14:22:19] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9880054 (10SD0001) 05Resolved→03Open Many failures from today. (Could it be a coincidence that it's occurring only on the days someone is trying to test T348407?) [14:24:08] 10Tool-wd-image-positions: Logging in causes the user to go back to the index page - https://phabricator.wikimedia.org/T367188 (10Abbe98) 03NEW [14:29:24] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9880096 (10fnegri) I suspect they are related yes. Maybe Quarry is trying to connect to another database but using ToolsDB credentials, or viceversa. [14:30:16] (03update) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [14:30:43] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): [infra,k8s,monitoring] Add an alert to warn when the prometheus k8s cert is about to expire - https://phabricator.wikimedia.org/T366579#9880100 (10dcaro) This seems a bit trickier than just adapting the above query, still investigating how to get tha... [14:31:51] (03update) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [14:37:28] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [14:59:06] FIRING: CephSlowOps: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:59:06] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T367191 (10phaultfinder) 03NEW [14:59:06] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:04:48] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:04:48] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:04:48] FIRING: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-22 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:06:56] FIRING: InstanceDown: Project cloudinfra instance enc-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:06:56] FIRING: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:06:56] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:06:56] FIRING: [7x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:06:56] FIRING: [2x] InstanceDown: Project gitlab-runners instance runner-1023 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:06:56] FIRING: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:06:56] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:06:56] FIRING: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:06:56] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:06:56] FIRING: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:06:56] FIRING: SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:07:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:08:10] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 2425 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:08:15] FIRING: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [15:19:05] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:19:05] FIRING: [2x] InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:19:05] FIRING: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:19:05] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:19:05] FIRING: WidespreadInstanceDown: Widespread instances down in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:19:05] FIRING: [2x] InstanceDown: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: InstanceDown: Project clouddb-services instance clouddb-services-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [6x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:19:05] FIRING: [13x] InstanceDown: Project toolsbeta instance toolsbeta-harbor-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [7x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:19:05] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [15:19:05] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-ingress-6 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:19:05] FIRING: [4x] InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [8x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [25x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [8x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:19:05] FIRING: [4x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:19:05] FIRING: OOM: OOM killer active on cloudcephmon1001:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [15:19:05] FIRING: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-ingress-6 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:24:17] FIRING: CephSlowOps: Ceph cluster in eqiad has 5127 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:24:17] FIRING: [6x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:24:17] FIRING: [3x] ProbeDown: Service toolsbeta-proxy-6:443 has failed probes (http_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:24:17] FIRING: [22x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:24:17] FIRING: [8x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:24:17] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 5127 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:24:17] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 81.080 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:24:17] FIRING: [4x] ProbeDown: Service toolsbeta-proxy-6:443 has failed probes (http_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:24:17] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [15:24:17] RESOLVED: OOM: OOM killer active on cloudcephmon1001:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [15:24:17] FIRING: [4x] InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:24:17] RESOLVED: InstanceDown: Project extdist instance extdist-06 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:24:17] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:24:24] FIRING: CephSlowOps: Ceph cluster in eqiad has 882 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:24:28] FIRING: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:25:21] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:25:21] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:25:21] RESOLVED: WidespreadInstanceDown: Widespread instances down in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:25:21] FIRING: [2x] InstanceDown: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:21] FIRING: [82x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:21] RESOLVED: InstanceDown: Project clouddb-services instance clouddb-services-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:21] RESOLVED: CloudinfraMariaDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCloudinfraMariaDBWritableState [15:25:21] FIRING: [16x] InstanceDown: Project cloudinfra instance cloud-cumin-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:21] FIRING: [6x] InstanceDown: Project project-proxy instance maps-proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:21] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:25:25] RESOLVED: [6x] InstanceDown: Project project-proxy instance maps-proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:25] FIRING: [82x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:25] RESOLVED: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:25:29] RESOLVED: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:25:32] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:25:36] FIRING: [16x] InstanceDown: Project cloudinfra instance cloud-cumin-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:25:39] FIRING: [9x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:26:06] FIRING: [5x] ProbeDown: Service toolsbeta-proxy-6:443 has failed probes (http_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:26:28] FIRING: [25x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:26:37] RESOLVED: [8x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:26:37] RESOLVED: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:26:37] RESOLVED: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:26:37] RESOLVED: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:26:38] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 9.576 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:28:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:28:38] RESOLVED: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-ingress-6 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:28:45] RESOLVED: [4x] InstanceDown: Project cvn instance cvn-apache10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:29:00] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:29:41] FIRING: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [15:29:44] RESOLVED: [74x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:29:44] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:29:44] RESOLVED: [2x] InstanceDown: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:29:47] RESOLVED: [14x] InstanceDown: Project cloudinfra instance cloud-cumin-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:29:58] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:30:28] RESOLVED: [8x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:30:54] RESOLVED: [5x] ProbeDown: Service toolsbeta-proxy-6:443 has failed probes (http_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:31:28] RESOLVED: [21x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:31:56] FIRING: [4x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:34:44] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:36:56] RESOLVED: [4x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:39:44] RESOLVED: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:42:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-redis-6 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:50:54] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers [15:50:54] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for all NFS workers [15:51:18] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers [15:54:41] RESOLVED: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [15:57:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance tools-redis-6 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:58:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:05:11] FIRING: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [16:17:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-39 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:22:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-24 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:22:51] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:24:41] 10VPS-project-Codesearch, 10MediaWiki-Vagrant: Hound does not seem to index all files in mediawiki/vagrant - https://phabricator.wikimedia.org/T367196#9880768 (10Lucas_Werkmeister_WMDE) > Maybe related to the line length (its over 2500 characters)? Probably yes, [maxLineLen = 2000](https://github.com/hound-se... [16:25:11] RESOLVED: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [16:27:03] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-12 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:27:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-services-05 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:27:51] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:31:52] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:32:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-37 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:37:28] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-services-05 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:38:14] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [16:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:38:22] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:41:30] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [16:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:41:43] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:42:25] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [16:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:43:25] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9880907 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808 I am working on some options for this to show folks and ask which they like better. [16:47:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:52:29] 06cloud-services-team, 10Toolforge: Consider adding `kubectl`, `webservice`, and `toolforge` binaries to shell container images - https://phabricator.wikimedia.org/T360818#9880943 (10bd808) {T363027} might be one way to make this functionality possible. [16:55:45] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T367191#9880977 (10dcaro) 05Open→03Resolved a:03dcaro [17:34:45] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all NFS workers [17:35:05] 10Tool-schedule-deployment, 10WikimediaDebug: Integrate schedule-deployment with WikimediaDebug - https://phabricator.wikimedia.org/T367213 (10LucasWerkmeister) 03NEW [17:39:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:49:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:59:48] 06cloud-services-team, 10wikitech.wikimedia.org, 13Patch-For-Review: Disable SSH key management on Wikitech - https://phabricator.wikimedia.org/T359544#9881358 (10taavi) 05Open→03Resolved [18:00:25] 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 13Patch-For-Review: `Nova Resource:` namespace should be declared in wmf-config, not in Extension:OpenStackManager - https://phabricator.wikimedia.org/T338477#9881362 (10taavi) 05Open→03Resolved [18:01:53] 06cloud-services-team, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 13Patch-For-Review: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553#9881383 (10taavi) [18:02:40] 10Tool-schedule-deployment, 10WikimediaDebug: Integrate schedule-deployment with WikimediaDebug - https://phabricator.wikimedia.org/T367213#9881386 (10bd808) I believe this would need to work by adding something to the WikimediaDebug extension that changes what is rendered on the page. There are [[https://stac... [18:15:47] 06cloud-services-team, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 13Patch-For-Review: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553#9881479 (10Jdforrester-WMF) [18:15:53] 10MediaWiki-extensions-OpenStackManager, 06Diffusion-Repository-Administrators, 10Projects-Cleanup, 06translatewiki.net, 10Wikimedia-GitHub: Archive the OpenStackManager extension - https://phabricator.wikimedia.org/T367220 (10taavi) 03NEW [18:18:39] 10MediaWiki-extensions-OpenStackManager, 06Diffusion-Repository-Administrators, 10Projects-Cleanup, 06translatewiki.net, 10Wikimedia-GitHub: Archive the OpenStackManager extension - https://phabricator.wikimedia.org/T367220#9881503 (10taavi) [18:18:43] 06cloud-services-team, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 13Patch-For-Review: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553#9881495 (10taavi) [18:18:50] 06cloud-services-team, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 13Patch-For-Review: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553#9881504 (10taavi) [18:19:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-24 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:23:00] 10MediaWiki-extensions-OpenStackManager, 06Infrastructure-Foundations, 07LDAP: Make a group blacklist for ssh key changes - https://phabricator.wikimedia.org/T36651#9881549 (10taavi) 05Open→03Declined Declining, OSM is being archived. I don't think this is something we're interested in these days, so... [18:23:52] 10MediaWiki-extensions-OpenStackManager: Possible to register mail addresses with trailing newlines - https://phabricator.wikimedia.org/T73692#9881571 (10taavi) 05Open→03Invalid Declining, OSM is being archived. If this is still an issue in its replacements please file new bugs against them. [18:25:47] 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Wikitech not showing ssh keys for some users - https://phabricator.wikimedia.org/T221427#9881598 (10taavi) 05Open→03Declined Declining, OSM is being archived. If this is still a bug in Bitu or Striker please file new bugs against them. [18:25:54] 10MediaWiki-extensions-OpenStackManager, 10MediaWiki-Vagrant: Update Vagrant role for Extension:OpenStackManager - https://phabricator.wikimedia.org/T103874#9881588 (10taavi) 05Open→03Declined Declining, OSM is being archived. [18:25:57] 10Cloud-Services, 10wikitech.wikimedia.org, 13Patch-For-Review: Grant shell user right with project memberships and remove autocreation of shell requests - https://phabricator.wikimedia.org/T97334#9881591 (10taavi) The #Cloud-Services project tag is not intended to have any tasks. Please check the list o... [18:26:05] 10MediaWiki-extensions-OpenStackManager, 07I18n: keypairimported message should be corrected - https://phabricator.wikimedia.org/T252234#9881605 (10taavi) 05Open→03Declined Declining, OSM is being archived. If this is an issue with its replacements please file new bugs agsinst them. [18:26:09] 10MediaWiki-extensions-OpenStackManager: OpenStackManager should adjust selenium tests, rather than just not run them - https://phabricator.wikimedia.org/T250420#9881602 (10taavi) 05Open→03Declined Declining, OSM is being archived. [18:26:14] 06cloud-services-team, 10wikitech.wikimedia.org, 10Wikimedia-Site-requests: Rename Wikitech Nova resource: namespace to something that is more commonly used - https://phabricator.wikimedia.org/T275796#9881607 (10taavi) [18:26:20] 10MediaWiki-extensions-OpenStackManager: Update OpenStackManager to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346554#9881612 (10taavi) 05Open→03Declined Declining, OSM is being archived. [18:26:55] 10Striker, 10Bitu, 06Infrastructure-Foundations, 07Security: Special:NovaKey should have a message not to add production keys - https://phabricator.wikimedia.org/T276761#9881609 (10taavi) [18:45:24] 10PAWS: update singleuser to 24.04 - https://phabricator.wikimedia.org/T366058#9881753 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/417 [18:45:40] vivian-rook closed https://github.com/toolforge/paws/pull/417 [18:46:14] 10PAWS: update singleuser to 24.04 - https://phabricator.wikimedia.org/T366058#9881754 (10rook) 05Open→03Resolved a:03rook [19:14:27] 10Tool-schedule-deployment: ScheduleDeploymentBot can add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229 (10Pppery) 03NEW [19:15:14] 10Tool-schedule-deployment: ScheduleDeploymentBot can add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229#9881878 (10Pppery) [19:25:34] 10Tool-schedule-deployment: ScheduleDeploymentBot can add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229#9881931 (10Pppery) Sorry wrong ticket [19:37:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:38:20] 10Tool-wd-image-positions: Logging in causes the user to go back to the index page - https://phabricator.wikimedia.org/T367188#9882001 (10LucasWerkmeister) Yeah, I implemented this for Wikidata Lexeme Forms but not most (any?) of my other tools. Shouldn’t be too hard to add though ^^ [19:47:41] RESOLVED: [2x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:09:28] 10Tool-schedule-deployment: ScheduleDeploymentBot can add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229#9882091 (10bd808) I thought about having the bot count patches and hide windows that already have 6+ attached, but initially decided that this wasn't needed. The automati... [20:09:58] 10Tool-schedule-deployment: ScheduleDeploymentBot should refuse to add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229#9882094 (10bd808) p:05Triage→03Medium [20:09:59] 10Tool-schedule-deployment: ScheduleDeploymentBot should refuse to add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229#9882095 (10bd808) [20:16:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [20:16:07] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235 (10phaultfinder) 03NEW [21:01:45] 10Tool-Global-user-contributions, 06Stewards-and-global-tools, 07Epic, 10Temporary accounts (Create/update essential tools/anti-abuse management): [Epic] Implement global user contributions feature - https://phabricator.wikimedia.org/T337089#9882245 (10MusikAnimal) [21:09:44] 10Quarry: [bug] Trouble running Quarry queries - https://phabricator.wikimedia.org/T367237 (10Liz) 03NEW [21:12:26] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9882298 (10JJMC89) [21:12:28] 10Quarry: [bug] Trouble running Quarry queries - https://phabricator.wikimedia.org/T367237#9882296 (10JJMC89) →14Duplicate dup:03T365374 [22:07:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:12:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:17:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:22:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:51:15] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9882558 (10Liz) Sorry to post a duplicate notice. I couldn't find this one when I wanted to post about this bug. [22:58:36] (03open) 10raymond-ndibe: [jobs-api] move jobs load to backend [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T366209) [23:45:13] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9882611 (10bd808) I made a helper function that can build a message describing the scheduled deployment and providing links to both the `[[Deployment]]` page and zonest... [23:57:43] 10Tools: Re-implement WikiShootMe as a customisable frontend JS app - https://phabricator.wikimedia.org/T364142#9882634 (10TuukkaH) In the hackathon, we ended up working on a wider solution (T364174) which this map could be a part of. (The idea is to improve the Commons Upload Wizard, which the "upload" buttons... [23:58:43] 10Tools: Re-implement WikiShootMe as a customisable frontend JS app - https://phabricator.wikimedia.org/T364142#9882637 (10TuukkaH)