[00:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:11:51] RESOLVED: TfInfraTestApplyFailed: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:28:01] (03merge) 10legoktm: Update pagelinks access [toolforge-repos/poty-stuff] - 10https://gitlab.wikimedia.org/toolforge-repos/poty-stuff/-/merge_requests/2 (https://phabricator.wikimedia.org/T299947) (owner: 10tacsipacsi) [00:29:09] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [00:29:09] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [00:29:39] FIRING: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [00:29:39] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [00:29:44] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [00:29:50] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [00:31:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:34:09] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [00:34:09] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [00:34:39] RESOLVED: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [00:34:39] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [00:34:44] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [00:34:50] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [01:08:29] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:13:29] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:17:14] 06Toolforge-standards-committee: Adoption request for Yapperbot - https://phabricator.wikimedia.org/T361426#10094727 (10DavidTornheim) FYI. It's still not working: https://en.wikipedia.org/w/index.php?title=User_talk%3AYapperbot&diff=1242503263&oldid=1241904699 I am having trouble getting into Toolforge SSH to... [05:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:13:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:13:29] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [05:13:34] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [05:13:59] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [05:18:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:18:29] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [05:18:34] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [05:18:59] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [05:23:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:48:47] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [05:48:47] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [05:48:50] FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [05:53:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:53:47] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [05:53:47] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [05:53:50] RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [06:13:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:29:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:49:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:02:47] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10094808 (10dcaro) p:05Unbreak!→03Medium Currently cleaning up the old nodes, but everything seems stable [07:02:57] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10094814 (10dcaro) >>! In T373243#10091656, @MBH wrote: > When I'm trying to build an image from my github repo, I got this strange issue: > > `unable to access 'https://git... [07:04:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:07:58] (03open) 10sstefanova: fix upgrade [repos/cloud/toolforge/calico] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/9 [07:16:28] (03approved) 10dcaro: fix upgrade [repos/cloud/toolforge/calico] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/9 (owner: 10sstefanova) [07:20:22] (03merge) 10sstefanova: fix upgrade [repos/cloud/toolforge/calico] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/9 [07:21:55] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: calico: bump to 0.0.8-20240731084636-9937ff2a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) [07:21:58] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: calico: bump to 0.0.8-20240731084636-9937ff2a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) [07:36:23] (03PS1) 10David Caro: openstack: security and server group list [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1067227 [07:36:59] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:43:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:43:40] (03open) 10countcount: Fix user does not exist error for users without groups [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/1 [07:48:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:50:15] (03open) 10countcount: fix relative urls to stylesheet, image and toolforge [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/2 [07:53:28] RESOLVED: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:01:07] (03open) 10countcount: Remove the unnecessary link to the "new" rules [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/3 [08:01:50] (03update) 10countcount: Remove the unnecessary link to the "new" rules [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/3 [08:04:58] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:19:05] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:19:32] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker role in the tools cluster [08:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:20:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:23:41] (03open) 10countcount: Remove ptwiki support. [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/4 [08:24:38] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-4 (T373243) [08:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:24:42] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [08:26:28] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-4 (T373243) [08:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:26:54] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-15 (T373243) [08:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:27:37] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10094943 (10Curb_Safe_Charmer) 05Stalled→03Resolved Appears to be working again now, but no idea why. [08:27:57] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component calico [08:28:13] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component calico [08:28:40] (03open) 10countcount: Update repo location and authors [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/5 [08:29:13] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-15 (T373243) [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:29:23] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-18 (T373243) [08:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:30:32] 10Tool-stimmberechtigung: Migrate tool-stimmberechtigung from GitHub to Wikimedia Gitlab - https://phabricator.wikimedia.org/T373242#10094965 (10Count_Count) 05Open→03Resolved Now at https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung [08:30:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:31:12] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-18 (T373243) [08:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:31:16] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [08:31:19] (03update) 10countcount: fix relative urls to stylesheet, image and toolforge [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/2 [08:31:21] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-25 (T373243) [08:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:31:46] (03update) 10countcount: Fix user does not exist error for users without groups [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/1 [08:32:44] (03merge) 10countcount: Fix user does not exist error for users without groups [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/1 [08:32:56] (03merge) 10countcount: fix relative urls to stylesheet, image and toolforge [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/2 [08:33:06] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-25 (T373243) [08:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:33:24] (03merge) 10countcount: Remove the unnecessary link to the "new" rules [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/3 [08:33:39] (03merge) 10countcount: Remove ptwiki support. [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/4 [08:33:55] !log dcaro@cloudcumin1001 usdtest START - Cookbook wmcs.vps.create_project for project usdtest in eqiad1 (T373386) [08:33:55] dcaro@cloudcumin1001: Unknown project "usdtest" [08:33:55] T373386: Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386 [08:34:00] (03merge) 10countcount: Update repo location and authors [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/5 [08:34:06] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-51 (T373243) [08:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:34:32] !log dcaro@cloudcumin1001 usdtest END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project usdtest in eqiad1 (T373386) [08:34:32] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:11] !log dcaro@cloudcumin1001 usdtest START - Cookbook wmcs.vps.add_user_to_project for user 'sbassett' in role 'member' (T373386) [08:35:11] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:17] !log dcaro@cloudcumin1001 usdtest END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'sbassett' in role 'member' (T373386) [08:35:17] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:23] !log dcaro@cloudcumin1001 usdtest START - Cookbook wmcs.vps.add_user_to_project for user 'mmartorana' in role 'member' (T373386) [08:35:23] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:29] !log dcaro@cloudcumin1001 usdtest END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'mmartorana' in role 'member' (T373386) [08:35:29] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:41] !log dcaro@cloudcumin1001 usdtest START - Cookbook wmcs.vps.add_user_to_project for user 'acooper' in role 'member' (T373386) [08:35:41] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:47] !log dcaro@cloudcumin1001 usdtest END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'acooper' in role 'member' (T373386) [08:35:48] dcaro@cloudcumin1001: Unknown project "usdtest" [08:35:51] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-51 (T373243) [08:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:37:07] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-52 (T373243) [08:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:37:12] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [08:38:58] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-52 (T373243) [08:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:39:58] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-25 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:40:44] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10094990 (10dcaro) 05Open→03Resolved a:03dcaro This is done! Added you all as admins with the default quota, let me know if you need anything else, enjoy! [08:41:01] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component calico [08:44:58] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-25 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:46:27] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component calico [08:50:46] (03update) 10sstefanova: calico: bump to 0.0.9-20240827072036-84ea5a22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [08:51:42] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [08:53:36] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-104 (T373243) [08:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:53:40] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [08:55:28] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-104 (T373243) [08:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:56:27] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10095002 (10MBH) Yes, problem is fixed, thanks. [08:56:58] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component calico [08:57:13] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component calico [08:59:37] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component calico [09:05:23] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component calico [09:06:41] (03update) 10sstefanova: calico: bump to 0.0.9-20240827072036-84ea5a22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:07:50] (03open) 10countcount: Update replica database hostname [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/6 [09:07:58] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:08:11] (03update) 10countcount: Update replica database hostname [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/6 [09:08:32] (03merge) 10countcount: Update replica database hostname [toolforge-repos/stimmberechtigung] - 10https://gitlab.wikimedia.org/toolforge-repos/stimmberechtigung/-/merge_requests/6 [09:09:42] (03CR) 10David Caro: [C:03+2] toolforge.component.deploy: remove the k8s prefix [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059890 (owner: 10David Caro) [09:09:53] (03CR) 10David Caro: [C:03+2] toolforge.component.deploy: use bump_ as default branch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059905 (owner: 10David Caro) [09:12:22] (03CR) 10David Caro: "Technically you don't need the wmcs.yaml file at all, that's only needed in cloudcumin as we can't change the spicerack config directly." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059921 (owner: 10David Caro) [09:12:29] (03CR) 10David Caro: "Acknowledged" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059921 (owner: 10David Caro) [09:12:43] (03CR) 10David Caro: [C:03+2] wmcs_libs.common: add run_script [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059906 (owner: 10David Caro) [09:12:49] (03CR) 10David Caro: [C:03+2] toolforge.run_tests: use the functional tests [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059907 (owner: 10David Caro) [09:13:37] (03update) 10sstefanova: calico: bump to 0.0.9-20240827072036-84ea5a22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:13:41] (03approved) 10sstefanova: calico: bump to 0.0.9-20240827072036-84ea5a22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:13:46] (03merge) 10sstefanova: calico: bump to 0.0.9-20240827072036-84ea5a22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:13:48] 10Tool-stimmberechtigung: Solve bug in Tool-stimmberechtigung as reported - https://phabricator.wikimedia.org/T373241#10095050 (10Count_Count) 05Open→03Resolved [09:13:58] 10Tool-stimmberechtigung: Merge Hgzh Github Pull Request to Tool-stimmberechtigung - https://phabricator.wikimedia.org/T373240#10095053 (10Count_Count) 05Open→03Resolved [09:14:07] (03CR) 10David Caro: [C:03+2] openstack.tofu: use run_script instead of reimplementing it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059919 (owner: 10David Caro) [09:14:08] (03Merged) 10jenkins-bot: toolforge.component.deploy: remove the k8s prefix [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059890 (owner: 10David Caro) [09:14:14] (03CR) 10David Caro: [C:03+2] toolforge.deploy: run tests and add note to MR [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059921 (owner: 10David Caro) [09:14:14] 06cloud-services-team, 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: upgrade all Kubernetes components to versions supporting Kubernetes 1.26 - https://phabricator.wikimedia.org/T370046#10095029 (10Slst2020) 05Stalled→03In progress [09:14:15] (03Merged) 10jenkins-bot: toolforge.component.deploy: use bump_ as default branch [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059905 (owner: 10David Caro) [09:14:44] (03CR) 10David Caro: [C:03+2] deploy: wait by default for the k8s components to finish deploying [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1061957 (owner: 10David Caro) [09:15:56] (03Merged) 10jenkins-bot: wmcs_libs.common: add run_script [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059906 (owner: 10David Caro) [09:16:39] (03Merged) 10jenkins-bot: toolforge.run_tests: use the functional tests [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059907 (owner: 10David Caro) [09:17:58] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:19:08] (03Merged) 10jenkins-bot: openstack.tofu: use run_script instead of reimplementing it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059919 (owner: 10David Caro) [09:19:09] (03Merged) 10jenkins-bot: toolforge.deploy: run tests and add note to MR [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059921 (owner: 10David Caro) [09:19:25] (03Merged) 10jenkins-bot: deploy: wait by default for the k8s components to finish deploying [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1061957 (owner: 10David Caro) [09:36:25] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [09:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:46:20] !log dcaro@urcuchillay tools Added a new k8s worker tools-k8s-worker-108.tools.eqiad1.wikimedia.cloud to the cluster [09:46:20] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [09:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:53:46] 10Toolforge (Toolforge iteration 14): DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10095142 (10dcaro) 05Open→03Resolved a:03dcaro I'll close this as it's been stable for a while and all the misbehaving nodes have been dele... [09:54:32] 10Toolforge (Toolforge iteration 14): DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10095146 (10dcaro) [09:54:35] 10Toolforge (Toolforge iteration 14): DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10095147 (10Stuartyeates) The issues I was seeing previously appear to have all resolved themselves, thank you. [10:15:07] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095241 (10dcaro) a:03dcaro [10:15:37] 10Data-Services, 06Data-Engineering, 06Data-Platform-SRE, 06DBA: Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912#10095249 (10Ladsgroup) How can I escalate this? It's been a month now and community is waiting. [10:27:38] 10Tools: Flickr2 Commons is currently down - https://phabricator.wikimedia.org/T372451#10095284 (10Magnus) 05Open→03Resolved a:03Magnus tl;dr: I was on vacation with no ssh access. Webservice restarted (which could/should be done by Toolforge automatically but isn't), seems to work again. [10:37:52] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095313 (10Slst2020) 05In progress→03Resolved [10:45:30] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095362 (10dcaro) 05Resolved→03Open So far, the only ways to reduce memory usage that I've seen are: * Increase scraping int... [10:54:30] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate coibot.linkwatcher.eqiad.wmflabs is about to expire in 11d 13h 26m 37s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:54:58] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:56:38] 10Data-Services, 06Data-Engineering, 06DBA, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912#10095405 (10BTullis) a:03BTullis [11:01:36] FIRING: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state [11:01:36] FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [11:04:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095418 (10dcaro) I have manually deactivated three scrape targets one by one: * kyverno (was in the top list of series) - did no... [11:23:40] 10Data-Services, 06Data-Engineering, 06DBA, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912#10095469 (10BTullis) >>! In T370912#10095249, @Ladsgroup wrote: > How can I escalate this? It's been a month now and commun... [11:36:19] 10Data-Services, 06Data-Engineering, 06DBA, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912#10095502 (10BTullis) 05Open→03Resolved p:05Triage→03High This is done now. The `sre.wikireplicas.add-wiki` cook... [11:55:25] 06cloud-services-team, 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: upgrade all Kubernetes components to versions supporting Kubernetes 1.26 - https://phabricator.wikimedia.org/T370046#10095567 (10Slst2020) 05In progress→03Resolved [12:06:12] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry [12:06:13] !log sstefanova@cloudcumin1001 tools Updating container image docker-registry.tools.wmflabs.org/nginx-ingress-controller:v1.11.2 [12:06:35] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=0) [12:14:58] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095610 (10dcaro) Added kyverno back, seems stable still (<20G ram). I can't find any place where we use any of the statistics f... [12:45:24] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095730 (10dcaro) We do, we use `nginx_ingress_controller_requests` and `nginx_ingress_controller_nginx_process_connections` only... [12:46:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10095725 (10dcaro) Interestingly enough, I've re-enabled the stats one by one, the culprit seems to be ingress: {F57303742} You... [13:11:33] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10095824 (10Novem_Linguae) Was this bug intermittent? Maybe the root cause was {T373243}, which affected a lot of toolforge tools over the last few days. The timeline lines up almost perfectly (this... [13:27:05] (03approved) 10sstefanova: toolforge: add calico to deployment list [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/185 (owner: 10dcaro) [13:27:25] (03merge) 10dcaro: toolforge: add calico to deployment list [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/185 [13:27:26] (03update) 10dcaro: toolforge: add calico to deployment list [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/185 [13:43:22] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10095954 (10KylieTastic) It wasn't intermittent for me and it started earlier than the ticket (can't remember when). I had no success for days. [13:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:03:42] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikicommunityhealth" project Buster deprecation - https://phabricator.wikimedia.org/T367560#10096133 (10Andrew) Hello @CristianCantoro, did you make any progress on this? Is there anything I can do to help things along? [14:07:43] 10Tool-bub2, 10Internet-Archive: Proposal: Integrate Wikimedia Ecosystem within BUB2 tool - https://phabricator.wikimedia.org/T352150#10096147 (10debt) 05Open→03Declined Closing this ticket as declined - Outreachy Round 27 closed about a year ago. Outreachy Round 29 is currently seeking projects and me... [14:07:52] 10Tool-bub2, 10Outreach-Programs-Projects, 13Patch-For-Review: Integrate Wikimedia Ecosystem within BUB2 tool - https://phabricator.wikimedia.org/T346386#10096163 (10debt) 05Open→03Resolved Closing this ticket as resolved as it looks like the work that was done on this ticket has been pushed to produ... [14:38:42] (03open) 10sstefanova: bump ingress-nginx to 4.11.2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/491 (https://phabricator.wikimedia.org/T373043) [14:39:28] (03update) 10sstefanova: bump ingress-nginx to v1.11.2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/491 (https://phabricator.wikimedia.org/T373043) [15:09:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:13:20] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096519 (10Andrew) > Hi Andrew, > > I’ve deploy the new 2FA implementation to idm-test.wikimedia.org. If you sign in, you should see a button labeled "Manage two factor authentication”. If you click that you’ll be guid... [15:14:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:14:53] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096539 (10Andrew) I think there are two ways to approach this: 1) Leave striker's auth setup mostly intact, and just replace the 2fa validation call to wikitech with a call to idm 2) Replace auth entirely with an oau... [15:38:03] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096607 (10bd808) >>! In T359554#10096539, @Andrew wrote: > 2) Replace auth entirely with an oauth request to idp This seems reasonable to me, but should also be blocked from deployment until IDP and IDM implement {T35... [15:42:40] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10096616 (10sbassett) Thanks! [15:49:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:00:38] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096692 (10Andrew) >>! In T359554#10096607, @bd808 wrote: >>>! In T359554#10096539, @Andrew wrote: >> 2) Replace auth entirely with an oauth request to idp > > This seems reasonable to me, but should also be blocked fr... [16:04:08] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096720 (10taavi) >>! In T359554#10096539, @Andrew wrote: > I think there are two ways to approach this: As the task description states, this task is about #2 as we (or at least I) want that to eventually happen. If you... [16:04:22] (03update) 10sstefanova: fix upgrade [repos/cloud/toolforge/calico] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/9 [16:04:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:05:00] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096724 (10Andrew) [16:05:30] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096725 (10Andrew) Too late, I hijacked! I will try to roll things back [16:32:06] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096949 (10Andrew) I will make a separate task for case #1 but I am also going to orphan this ticket as it is not a requirement for the great wikitech migration. [16:33:53] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096954 (10Andrew) [16:33:56] 06cloud-services-team, 10wikitech.wikimedia.org, 07LDAP, 13Patch-For-Review: Replace wikitech as source of two-factor auth protection for developer accounts - https://phabricator.wikimedia.org/T359551#10096955 (10Andrew) [16:34:22] 10Horizon: Use IDP for authentication in Horizon - https://phabricator.wikimedia.org/T359590#10096962 (10Andrew) [16:34:31] 06cloud-services-team, 10wikitech.wikimedia.org, 07LDAP, 13Patch-For-Review: Replace wikitech as source of two-factor auth protection for developer accounts - https://phabricator.wikimedia.org/T359551#10096963 (10Andrew) [16:36:43] 06cloud-services-team, 10wikitech.wikimedia.org, 07LDAP: Striker: use idm for 2fa validation instead of wikitech - https://phabricator.wikimedia.org/T373461 (10Andrew) 03NEW [16:37:33] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096981 (10Andrew) [16:38:03] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudcephosd1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:38:21] 06cloud-services-team, 10wikitech.wikimedia.org, 07LDAP: Horizon: use idm for 2fa validation instead of wikitech - https://phabricator.wikimedia.org/T373462 (10Andrew) 03NEW [16:38:52] 06cloud-services-team, 10wikitech.wikimedia.org, 07LDAP: Striker: use idm for 2fa validation instead of wikitech - https://phabricator.wikimedia.org/T373461#10097002 (10Andrew) Simon writes: > I’ve deploy the new 2FA implementation to idm-test.wikimedia.org. If you sign in, you should see a button labeled "... [16:40:10] 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10096984 (10Andrew) OK, this task is back to the case #2, log in entirely via idm. I've created T373461 for case #1 which we will do sooner. [17:15:50] 06cloud-services-team, 10wikitech.wikimedia.org, 07LDAP: Striker: use idm for 2fa validation instead of wikitech - https://phabricator.wikimedia.org/T373461#10097123 (10Andrew) I'm definitely going in circles here, but @bd808 suggests that we just skip ahead to https://phabricator.wikimedia.org/T359554 and l... [17:23:03] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudcephosd1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:25:52] 06Toolforge-standards-committee: Adoption request for Yapperbot - https://phabricator.wikimedia.org/T361426#10097169 (10bd808) >>! In T361426#10094727, @DavidTornheim wrote: > FYI. It's still not working: > https://en.wikipedia.org/w/index.php?title=User_talk%3AYapperbot&diff=1242503263&oldid=1241904699 ` pani... [17:30:25] (03open) 10lucaswerkmeister: Use shlex.quote() to quote URLs for shell [toolforge-repos/logos-purge] - 10https://gitlab.wikimedia.org/toolforge-repos/logos-purge/-/merge_requests/1 [17:34:15] (03merge) 10samtar: Use shlex.quote() to quote URLs for shell [toolforge-repos/logos-purge] - 10https://gitlab.wikimedia.org/toolforge-repos/logos-purge/-/merge_requests/1 (owner: 10lucaswerkmeister) [17:39:42] (03update) 10samtar: Use shlex.quote() to quote URLs for shell [toolforge-repos/logos-purge] - 10https://gitlab.wikimedia.org/toolforge-repos/logos-purge/-/merge_requests/1 (owner: 10lucaswerkmeister) [17:42:04] 10Tool-logos-purge: Entering an invalid gerrit change ID breaks the tool - https://phabricator.wikimedia.org/T373469 (10TheresNoTime) 03NEW [17:46:02] (03update) 10lucaswerkmeister: Draft: Use printf to purge images in a single command [toolforge-repos/logos-purge] - 10https://gitlab.wikimedia.org/toolforge-repos/logos-purge/-/merge_requests/2 [18:43:51] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10097462 (10sbassett) 05Resolved→03Open Hey @dcaro - I think I needed to explicitly request a floating IP for this project. If that's not possible, could we reallocate... [18:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:12:00] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10097531 (10bd808) >>! In T373386#10097462, @sbassett wrote: > I think I needed to explicitly request a floating IP for this project. If that's not possible, could we real... [19:17:34] 10Cloud-VPS (Quota-requests): Request a CloudVPS floating IP for the usdtest account - https://phabricator.wikimedia.org/T373477 (10sbassett) 03NEW [19:17:56] 10Cloud-VPS (Quota-requests): Request a CloudVPS floating IP for the usdtest account - https://phabricator.wikimedia.org/T373477#10097556 (10sbassett) [19:17:58] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10097557 (10sbassett) [19:25:17] (03CR) 10Jean-Frédéric: [C:03+2] "Done" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil) [19:27:08] (03Merged) 10jenkins-bot: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil) [19:27:58] (03PS6) 10Jean-Frédéric: Use toolforge-jobs to install requirements during deployment [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 (https://phabricator.wikimedia.org/T319787) [19:28:10] (03PS6) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125 [19:45:19] 10Cloud-VPS (Quota-requests): Request a CloudVPS floating IP for the usdtest account - https://phabricator.wikimedia.org/T373477#10097647 (10bd808) > This project will need to serve publicly-accessible web applications for various development, testing and staging work. Will https://wikitech.wikimedia.org/wiki/H... [20:21:16] 10Cloud-VPS (Quota-requests): Request a CloudVPS floating IP for the usdtest account - https://phabricator.wikimedia.org/T373477#10097821 (10sbassett) 05Open→03Invalid p:05Triage→03Low >>! In T373477#10097647, @bd808 wrote: > Will https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cl... [20:21:19] 10Horizon: Use IDP for authentication in Horizon - https://phabricator.wikimedia.org/T359590#10097826 (10Andrew) Some doc links: https://docs.openstack.org/keystone/pike/advanced-topics/federation/openidc.html https://platform9.com/blog/openstack-keystone-single-sign-on/ This all seems reasonably possible, but... [20:40:57] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10097897 (10Jdlrobson) Do we have a rough estima... [21:10:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:18:58] (03PS1) 10JHathaway: puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) [21:19:37] (03CR) 10JHathaway: [C:03+2] puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:19:40] (03CR) 10JHathaway: [V:03+2 C:03+2] puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:20:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:30:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:10:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:41:35] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:42:01] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.31 ms [23:48:11] FIRING: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:53:11] RESOLVED: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown